Reconstructing 3D indoor scenes from pose images is usually divided into two stages: image depth estimation, followed by depth merging and surface reconstruction. Recently, several studies have proposed a series of methods that perform reconstruction directly in the final 3D volumetric feature space. Although these methods have achieved impressive reconstruction results, they rely on expensive 3D convolutional layers, limiting their application in resource-constrained environments.
Now, researchers from institutions such as Niantic and UCL are trying to reuse traditional methods and focus on high-quality multi-view depth prediction, finally using simple and off-the-shelf depth fusion methods. Highly accurate 3D reconstruction.
This research uses powerful image first A 2D CNN is carefully designed based on the experiment as well as the plane scan feature quantity and geometric loss. The proposed method SimpleRecon achieves significantly leading results in depth estimation and allows online real-time low-memory reconstruction.
As shown in the figure below, SimpleRecon’s reconstruction speed is very fast, taking only about 70ms per frame.
The comparison results between SimpleRecon and other methods are as follows:
MethodThe depth estimation model is located at the intersection of monocular depth estimation and planar scanning MVS. Researchers use cost volume (cost volume) to increase the depth prediction encoder-decoder. Architecture, as shown in Figure 2. The image encoder extracts matching features from the reference and source images as input to the cost volume. A 2D convolutional encoder-decoder network is used to process the output of the cost volume, which is augmented with image-level features extracted by a separate pre-trained image encoder.
The key to this research is to inject existing metadata into the cost volume along with typical deep image features to allow network access to useful information, such as geometry and relative camera pose information. Figure 3 shows the feature volume construction in detail. By integrating this previously untapped information, our model is able to significantly outperform previous methods in depth prediction without expensive 4D cost volumes, complex temporal fusion, and Gaussian processes.
The study was implemented using PyTorch and used EfficientNetV2 S as the backbone, which has a decoder similar to UNet. In addition, they also used ResNet18 The first 2 blocks were used for matching feature extraction, the optimizer was AdamW, and it took 36 hours to complete on two 40GB A100 GPUs.
Network architecture designThe network is implemented based on the 2D convolutional encoder-decoder architecture. When building such a network, research has found that there are some important design choices that can significantly improve depth prediction accuracy, mainly including:
Baseline cost volume fusion: Although the RNN-based temporal fusion method are often used, but they significantly increase the complexity of the system. Instead, the study makes cost volume fusion as simple as possible and finds that simply adding the dot product matching costs between the reference view and each source view can give results that are competitive with SOTA depth estimation.
Image encoder and feature matching encoder: Previous research has shown that image encoder is very important for depth estimation, both in monocular and multi-view estimation. For example, DeepVideoMVS uses MnasNet as the image encoder, which has relatively low latency. The study recommends using a small but more powerful EfficientNetv2 S encoder, which significantly improves depth estimation accuracy, although this comes at the cost of an increased number of parameters and a 10% reduction in execution speed.
Fusing multi-scale image features to cost volume encoder: In 2D CNN-based depth stereo and multi-view stereo, image features are usually combined with cost volume output on a single scale. Recently, DeepVideoMVS proposes to stitch deep image features at multiple scales, adding skip connections between image encoders and cost volume encoders at all resolutions. This is helpful for LSTM-based fusion networks, and the study found that it is also important for their architecture.
This study trained and evaluated the proposed method on the 3D scene reconstruction dataset ScanNetv2. Table 1 below uses the metrics proposed by Eigen et al. (2014) to evaluate the depth prediction performance of several network models.
Surprisingly, the model proposed in this study does not use 3D convolution, but outperforms all baseline models in depth prediction indicators. Furthermore, baseline models that do not use metadata encoding also perform better than previous methods, indicating that a well-designed and trained 2D network is sufficient for high-quality depth estimation. Figures 4 and 5 below show qualitative results for depth and normal.
This study used the standard protocol established by TransformerFusion for 3D reconstruction evaluation. The results are shown in Table 2 below. .
For online and interactive 3D reconstruction applications, reducing sensor latency is critical. Table 3 below shows the ensemble computation time per frame for each model given a new RGB frame.
In order to verify the effectiveness of each component in the method proposed in this study, the researcher conducted an ablation experiment, and the results are shown in Table 4 below.
Interested readers can read the original text of the paper to learn more about the research details.
The above is the detailed content of A100 implements a 3D reconstruction method without 3D convolution, and only takes 70ms for each frame reconstruction. For more information, please follow other related articles on the PHP Chinese website!