Bachelor’s Thesis

As the final requirement of my bachelor’s degree, I completed a thesis titled Photorealistic Rendering of Training Data for Object Detection and Pose Estimation with a Physics Engine. The full thesis is available in the Papers section; this page provides a concise overview of the problem, approach, and results.


Pose Estimation

To understand the motivation behind the thesis, it is useful to consider the task of pose estimation. In computer vision, pose estimation refers to inferring the 3D pose of an object, including translation and rotation, from 2D observations such as RGB images and, in some cases, depth data.

Inferred bounding boxes using pose estimation

Modern approaches to this problem often rely on convolutional neural networks, which require large amounts of labeled training data. In this context, labeled data must include at least the 3D poses of visible objects. Acquiring such data is difficult: large real-world datasets are limited, and the generation of synthetic datasets that generalize well remains an active research problem.


Generator Pipeline

The goal of the thesis was to develop a data generation pipeline that combines the realism of captured imagery with the flexibility and scale of synthetic data generation. The core idea was to render virtual objects and blend them into existing real-world photographs.

Example outputs from the synthetic image generator

The virtual objects were based on scans from the LineMOD dataset, while the background imagery came from 3RScan, a dataset of reconstructed indoor scenes captured with RGB-D sensors. This combination made it possible to incorporate real scene context while still simulating the interaction of inserted objects with the environment.

The generator first sampled objects into a randomly selected scene and simulated their physical interaction using NVIDIA PhysX. Once the objects reached a stable configuration, the rendering stage began. Images with excessive blur were filtered out, and the remaining views were used for synthetic data generation. Rendering was implemented using Blender and Appleseed, as the project required an open-source software stack.

Because Blender does not provide a native C++ embedding interface, I used embedded Python together with Blender as a Python module to generate and execute render commands. Communication between the C++ pipeline and the Python-based rendering layer was handled through a flexible custom JSON input format, which helped keep both parts of the system largely decoupled.

At the time, Appleseed was available only on Windows and did not support Linux in a Blender-compatible setup. As part of the project, I created a Docker-based build pipeline for a Blender-compatible Linux build of Appleseed, which restored Linux support for this workflow.

Rendering Pipeline

The rendering process was structured into several stages. First, the depth of both the inserted objects and the underlying scene was rendered from a selected camera pose. This information was used to compute an object visibility mask, which determined whether the current sample contained enough visible content to justify further processing.

The mask generation step

If a sample passed this initial visibility test, the pipeline rendered a label image encoding object identities as grayscale values. Next, ambient occlusion was rendered for the objects while still accounting for the surrounding scene indirectly. After that, the objects were rendered using a physically based shading workflow, with the scene geometry contributing indirect diffuse and specular effects without being rendered directly into the final object layer.

Comparison of objects with and without including scene context

In the final stage, the ambient occlusion and PBR object renders were combined with the occlusion mask and blended into the real image. The resulting synthetic output was then stored together with the corresponding annotations and a combined depth map. On my system, the end-to-end pipeline ran entirely on the CPU and required approximately 15 seconds per image.

The PBR rendering step

Results

The generated outputs were promising and, in many cases, visually competitive with fully synthetic rendering approaches. At the same time, several limitations remained, particularly in lighting consistency, shadow integration, and clipping artifacts. These issues were identified as the main areas for future work, and the project was later extended through a follow-up thesis.

Comparison between my approach (right) and a fully synthetic rendering pipeline (left)

To evaluate the generated data, we trained a Mask R-CNN model on a dataset of 10,000 images. Training time was limited, and the resulting model often failed to detect objects consistently. However, when detections occurred, they were generally accurate, suggesting that the generated data had useful signal despite the remaining rendering limitations.

Mask R-CNN results on data generated by the pipeline

Conclusion

This thesis marked my first larger research project with a clear focus on computer vision and AI-related data generation. It combined physically based simulation, rendering, dataset generation, and downstream model evaluation into a single pipeline and produced results that were promising enough to motivate further academic work.

The project certainly outgrew the scope of a standard bachelor’s thesis and required substantial implementation effort across multiple technical domains. The source code is available on GitHub. Download links for the thesis itself and a precompiled Windows build of the generator are provided below.

Leave a Reply