To meet the growing demand for 3D creative tools in the Metaverse, 3D scene generation has received considerable attention recently. At the core of 3D content creation is inverse graphics, which aims to recover 3D representations from 2D observations. Considering the cost and labor required to create 3D assets, the ultimate goal of 3D content creation will be to learn 3D generative models from the vast amount of 2D images on the Internet. Recent work on generative models of 3D perception has addressed this problem to some extent, with most of the work leveraging 2D image data to generate object-centric content (e.g., faces, human bodies, or objects). However, the observation space of this type of generation task is in a finite domain, and the generated targets occupy a limited area of three-dimensional space. This raises a question, can we learn 3D generative models of unbounded scenes from massive Internet 2D images? For example, a vivid natural landscape that can cover any large area and expand infinitely (as shown below).
In this article, researchers from S-Lab of Nanyang Technological University proposed a new framework SceneDreamer, focusing on learning generative models of unbounded three-dimensional scenes from massive unlabeled natural images. By sampling scene noise and style noise, SceneDreamer is able to render diverse styles of natural scenes while maintaining extremely high three-dimensional consistency, allowing the camera to roam freely in the scene.
To achieve such a goal, we face the following three challenges:
1) Unbounded scenes lack efficient three-dimensional representation: no boundaries Scenes often occupy an arbitrarily large Euclidean space, which highlights the importance of efficient and expressive underlying 3D representations.
2) Lack of content alignment: Existing 3D generation work uses data sets with alignment properties (such as faces, human bodies, common objects, etc.). The target objects in these bounded scenes Usually have similar semantics, similar scale position and direction. However, in massive unlabeled 2D images, different objects or scenes often have very different semantics and have variable scales, positions, and orientations. This lack of alignment can lead to instability in generative model training.
3) Lack of camera pose priors: 3D generative models rely on priors of accurate camera poses or camera pose distributions to implement the inverse rendering process from images to 3D representations. However, natural images on the Internet come from different scenes and image sources, making it impossible for us to obtain accurate information or prior information about its camera pose.
To this end, we propose a principled adversarial learning framework SceneDreamer, which learns to generate unbounded three-dimensional scenes from massive unlabeled natural images. The framework consists of three main modules: 1) an efficient and expressive bird's-eye view (BEV) 3D scene representation; 2) a generative neural hash grid that learns a universal representation of the scene; 3) a style-driven volumetric renderer, and Training is performed directly from two-dimensional images through adversarial learning.
The above figure shows the main structure of SceneDreamer. During the inference process, we can randomly sample a simplex noise representing the scene structure and a Gaussian noise representing the scene style as input, and our model can render Large-scale three-dimensional scenes while supporting free movement of the camera. First we obtain a BEV scene representation consisting of a height map and a semantic map from the scene noise. Then, the BEV representation is used to explicitly construct a local 3D scene window to perform camera sampling, while encoding the BEV representation into scene features. We use the coordinates of sampling points and scene features to query the high-dimensional space encoded by a generative neural hash grid to obtain spatial differences and scene Differential latent variables. Finally, we integrate the latent variables on the camera light through a volume renderer modulated by style noise, and finally obtain a rendered two-dimensional image.
In order to learn boundaryless 3D scene generation, we hope that the scene should be expressed efficiently and with high quality. We propose to express a large-scale three-dimensional scene using BEV representation consisting of semantic maps and height maps. Specifically, we obtain the height map and semantic map from the bird's-eye view from the scene noise through a non-parametric map construction method. The height map records the height information of the scene surface points, while the semantic map records the semantic labels of the corresponding points. The BEV representation we use, which is composed of a semantic map and a height map, can: 1) represent a three-dimensional scene at n^2 complexity; 2) can obtain the semantics corresponding to the three-dimensional point, thereby solving the problem of content alignment. 3) Supports the use of sliding windows to synthesize infinite scenes, avoiding the generalization problem caused by fixed scene resolution during training.
In order to encode a three-dimensional representation that can generalize between scenes, we need to encode the spatial three-dimensional scene representation into a latent space to facilitate the training of adversarial learning. It is worth noting that for a large-scale unbounded scene, usually only its surface visible points are meaningful for rendering, which means that its parametric form should be compact and sparse. Existing methods such as tri-plane or three-dimensional convolution model space as a whole, but a large amount of model capacity is wasted on modeling invisible surface points. Inspired by the success of neural hash grids on 3D reconstruction tasks, we generalize their spatially compact and efficient properties to generative tasks and propose using generative neural hash grids to model 3D spatial features across scenes. Specifically, the hash function F_theta is used to map the scene features f_s and the spatial point coordinates x to the learnable parameters of the multi-scale mixture:
In order to ensure the three-dimensional consistency of rendering, we use a rendering network based on volume rendering to complete the mapping of three-dimensional space features to two-dimensional images. For a point on the camera light, we query the generative hash grid to obtain its corresponding feature f_x, use multi-layer MLP modulated by style noise to obtain the color and volume density of its corresponding point, and finally use volume rendering to convert a point All points on the camera ray are integrated into the color of the corresponding pixel.
The entire framework is directly trained end-to-end on 2D images through adversarial learning. The generator is the volume renderer mentioned above, and for the discriminator we use a semantic-aware discriminative network to distinguish between real and rendered images based on the semantic map projected onto the camera from the BEV representation. Please refer to our paper for more details.
After training is completed, we can generate a variety of 3D scenes by randomly sampling scene noise and style noise, with good depth information and 3D consistency, and support free camera trajectories Rendering:
Through the sliding window inference mode, we can generate ultra-large unbounded images that far exceed the training spatial resolution. 3D scene. The figure below shows a scene with 10 times the training spatial resolution, and smooth interpolation in both scene and style dimensions
Like similar interpolation smooth transition results, our framework supports The decoupled mode, that is, fixing the scene or style separately for interpolation, reflects the semantic richness of the latent space:
In order to verify the three-dimensional consistency of our method, we also use circular camera trajectories to render any scene, and reuse COLMAP for three-dimensional reconstruction, which can obtain better scene point clouds and The matching camera poses show that this method can generate a variety of three-dimensional scenes while ensuring three-dimensional consistency:
This work proposes SceneDreamer, a model for generating unbounded 3D scenes from massive 2D images. We are able to synthesize diverse large-scale 3D scenes from noise while maintaining 3D consistency and supporting free camera trajectories. We hope that this work can provide a new exploration direction and possibility for the game industry, virtual reality and metaverse ecology. Please refer to our project homepage for more details.
The above is the detailed content of Generate mountains and rivers with one click, in various styles, and learn to generate unlimited 3D scenes from 2D images. For more information, please follow other related articles on the PHP Chinese website!