Summary
The use of maps in robotics is fundamental to the discipline itself. Current advancements in the field have enabled autonomous robots to localize and navigate unknown environments using SLAM and planning techniques. However, the maps generated thus far only indicate geometric constraints, making them unusable for understanding what exists within that space.
We present a method for imbuing dense point maps with semantic information. Instead of operating in 3D space, where classifying point clouds is computationally expensive, we work in the 2D plane of images. We then transfer this information to three-dimensional space by leveraging advancements in image segmentation using neural networks.
Introduction
Thanks to the proliferation of improvements in SLAM and planning algorithms, autonomous robots can now localize and navigate with great precision in previously unknown environments. This is because the robot's perception closely matches reality. However, it lacks semantic information, which prevents autonomous agents from performing object-oriented tasks.
The absence of semantics in the perceived world implies that the environmental understanding is incomplete and, therefore, can be improved. Our work fills this information gap and lays the groundwork for enabling, for instance, object-oriented planning in unexplored environments.
In this study, we present an innovative method that transfers semantic knowledge obtained from vision to the 3D world, interpreted as a dense point cloud. To achieve this, we leverage previous work in semantic segmentation neural networks on 2D images to acquire labels. Subsequently, these labels are transmitted to the points in 3D space in a method-agnostic manner, regardless of the technique used for segmentation or for obtaining camera poses and 3D points.
Method

Our method is illustrated as follows:
a) Scene Capture: A scene is captured using cameras.
b) Semantic Mask Generation: Neural networks are employed to generate a semantic mask of the scene.
c) Dense Reconstruction: A dense reconstruction of the scene is obtained.
d) Dense Map with Semantics: Finally, our algorithm combines the results from steps (b) and (c) to generate a dense map that includes semantic information.
Scene segmentation
In this work, we chose the Deeplab-v3 architecture due to its high accuracy. This component is optional and modifiable because our method relies on the masks generated by the segmentation, not on the specific architecture used to obtain them.
Deeplab-v3 is built upon two core ideas: first, feature extraction from images using the Resnet architecture, and second, the ability to segment at multiple scales via Atrous Convolutions with pyramidal pooling.
Resnet. Resnet is a convolution-based architecture designed to extract image features for classification. Its novelty (and the origin of its name) lies in the introduction of residual connections between different blocks (each block being a set of various convolutions and activation layers). By introducing these connections, Resnet creates a direct gradient path, bypassing or shortcutting several layers. This effectively solves the vanishing gradient problem.
Atrous Convolutions. Standard convolutions are an efficient way to extract contextual information, and their utility in image processing with neural networks has been extensively demonstrated.
The convolutions presented in Chen et al.'s work are different. They allow for explicit adjustment of the filter's field of view and control over the resolution of feature responses. This enables the capture of context at multiple scales and, when applied to segmentation, the encoding of objects at various scales.
Deeplab-v3 Architecture. The Deeplab-v3 architecture begins with an initial series of convolutions responsible for feature extraction. These features are then fed into a pyramidal pooling module that uses several Atrous Convolutions in parallel with different fields of view. This is combined with information from image pooling to capture global features. This design allows Deeplab-v3 to effectively segment objects at multiple scales.
Dense Map Labeling
For labeling the dense map, the algorithm requires the following:
- Camera poses
- Dense reconstruction in the same coordinate space as the poses
- Processed masks from the images captured by the different cameras
To ensure greater generality, the method does not utilize a list of which cameras have seen which points (although this information is easily obtainable). This design choice complicates the solution as it necessitates an estimation step.
The algorithm's core idea is to reproject each point onto every camera pose to check if it's visible. If visible, its u and v coordinates are obtained, and these are used to query the corresponding labels from the mask image. These labels (one for each camera that viewed the point) are stored per point. Subsequently, a voting process is performed to determine the point's final class. This is necessary because a single point can be seen by multiple cameras, and segmentation errors might lead to differing labels.
When it comes to filtering out points not seen by the camera, we check that their u and v coordinates are greater than zero and less than the number of columns and rows of the image, respectively.
The problem with this method is that even if a point were occluded when the image was captured from a camera, the algorithm would still correctly project that point onto its image plane, and thus label it with the class of whatever occluded it.
To resolve this, we propose using a z-buffer to account for occlusions. The idea is to generate a map with the size of the image, that stores the minimum depth (in the z-direction) for each pixel of the mask. The z-direction represents depth along the projection direction in the camera's coordinate system. This way, we only assign the class to points with the shallowest depth. Since points are infinitesimal, the occlusion check is performed in pixel space.
Implementation Details
For feature extraction, matching, camera pose estimation, and dense reconstruction, we utilized the VisualSFM software. Specifically:
- Features were extracted using a GPU implementation of SIFT.
- Camera pose adjustment was performed using parallelized Bundle Adjustment.
- Dense reconstruction employed PMVS/CMVS.
The neural network tasked with segmentation was a TensorFlow implementation of DeepLab-v3, with the segmentation process executed through its Python interface. Finally, the algorithm for labeling the dense map and its visualization were carried out in Matlab.
Results
While a wide variety of standardized datasets are available for testing algorithms, this work requires that these sequences contain specific semantic information. This is because the neural network used is trained to segment a finite number of classes.
The images taken from the dataset video have a resolution of 1920 x 1080 pixels at 96 dpi. These images do not contain information about the camera's intrinsic parameters, and therefore, these had to be estimated during the Bundle Adjustment phase. Specifically, from the images captured from the video in the dataset, different point clouds and camera poses were generated.
Number images | Number of cameras | Number of sparse points | Number of dense points |
---|---|---|---|
37 | 28 | 10 324 | 184 122 |
The video sequence used for validating the method was recorded in an outdoor setting, primarily featuring a wall. This was done to maximize the number of points located by the algorithm and, as much as possible, ensure no issues arose from the localization pipeline.
Examples of some frames from the real world are:
After processing all the frames and executing the proposed algorithm, the result is a dense point cloud with each point assigned to its semantic meaning. We present results for segmenting a motorbike (where points assigned to this class are shaded in green), demonstrating that partial occlusions are handled correctly.
Conclusions
In this work, we have presented a method for imbuing dense point maps with semantic information based on 2D segmentation neural networks.
As the presented results demonstrate, the method is capable of obtaining point classes with acceptable accuracy, all without explicit knowledge of the relationship between 3D points and the cameras that captured them (which would reduce the problem's difficulty while simultaneously increasing result precision). Since we did not have that direct information, we proposed and utilized a z-buffer to estimate potential geometric occlusions between the points and cameras, thereby improving the results.
Future work
First, scaling to better segmentations is as straightforward as upgrading the underlying neural network used for predictions. More interestingly, we could improve the coverage of semantics across the point cloud.
One approach to achieve this would be to diffuse the labels of different points to nearby locations, which should smooth the results. Assuming that nearby regions can potentially refer to the same semantic object, this could help to hide errors in the reprojections or z-buffer tests. Additionally, constraints about the geometry of the points could enhance this diffusion, following ideas from bilateral filtering.