Generative AI large models are the focus of OpenAI’s efforts. It has already launched text-generated image models DALL-E and DALL-E 2, as well as POINT-E, which generates 3D models based on text earlier this year.
Recently, the OpenAI research team has upgraded the 3D generative model and newly launched Shap・E, which is a conditional generative model for synthesizing 3D assets. Currently, the relevant model weights, inference code and samples have been open sourced.
Let’s take a look at the generation effect first. Similar to generating images based on text, the 3D object model generated by Shap・E focuses on "unconstrained". For example, a plane that looks like a banana:
A chair that looks like a tree:
There are also classic examples, like the avocado chair:
Of course, some common objects can also be generated 3D model, for example a bowl of vegetables:
Donuts:
The Shap・E proposed in this article is a latent diffusion model on the 3D implicit function space, which can be rendered into NeRF and texture meshes. Given the same data set, model architecture and training calculations, Shap・E is better than similar explicit generation models. The researchers found that the pure text conditional model can generate diverse and interesting objects, which also demonstrates the potential of generating implicit representations.
Unlike working on a 3D generative model to produce a single output representation, Shap-E can Directly generate the parameters of the implicit function. Training Shap-E is divided into two stages: first training the encoder, which deterministically maps 3D assets into the parameters of the implicit function, and second training a conditional diffusion model on the output of the encoder. When trained on a large dataset of paired 3D and text data, the model is able to generate complex and diverse 3D assets in seconds. Compared with the point cloud explicit generation model Point・E, Shap-E models a high-dimensional, multi-representation output space, converges faster, and achieves equivalent or better sample quality.
Research backgroundThis article focuses on two implicit neural representations (INR) for 3D representation:
While INR is flexible and expressive, obtaining an INR for every sample in the dataset is expensive. Additionally, each INR may have many numerical parameters, which may cause difficulties when training downstream generative models. By solving these problems using autoencoders with implicit decoders, smaller latent representations can be obtained that are directly modeled with existing generative techniques. An alternative approach is to use meta-learning to create a dataset of INRs that share most of their parameters, and then train a diffusion model or normalized flow on the free parameters of these INRs. It has also been suggested that gradient-based meta-learning may not be necessary and instead the Transformer encoder should be trained directly to produce NeRF parameters conditioned on multiple views of a 3D object.
The researchers combined and expanded the above methods and finally obtained Shap·E, which became a conditional generation model for various complex 3D implicit representations. First generate INR parameters for the 3D asset by training a Transformer-based encoder, and then train a diffusion model on the output of the encoder. Unlike previous approaches, INRs are generated that represent both NeRF and meshes, allowing them to be rendered in a variety of ways or imported into downstream 3D applications.
When trained on a dataset of millions of 3D assets, our model is able to produce a variety of identifiable samples under text prompts. Shap-E converges faster than Point·E, a recently proposed explicit 3D generative model. It can achieve comparable or better results with the same model architecture, data set, and conditioning mechanism.
The researcher first trains the encoder to generate implicit representation, and then trains the diffusion model on the latent representation generated by the encoder, which is mainly divided into the following two steps Done:
#1. Train an encoder to produce the parameters of an implicit function given a dense explicit representation of a known 3D asset. The encoder produces a latent representation of the 3D asset followed by linear projection to obtain the weights of the multilayer perceptron (MLP);
2. Apply the encoder to the dataset, and then Training diffusion priors on the set. The model is conditioned on images or textual descriptions.
We trained all models on a large dataset of 3D assets using corresponding renderings, point clouds, and text captions.
3D Encoder
The encoder architecture is shown in Figure 2 below.
Potential diffusion
The generation model adopts the Point・E diffusion architecture based on the transformer, but uses a latent vector sequence instead Point cloud. The sequence of latent function shapes is 1024×1024 and is input to the transformer as a sequence of 1024 tokens, where each token corresponds to a different row of the MLP weight matrix. Therefore, this model is roughly computationally equivalent to the base Point·E model (i.e., has the same context length and width). On this basis, input and output channels are added to generate samples in a higher-dimensional space.
Encoder evaluation
The researcher conducted the entire encoder training process Tracks two rendering-based metrics. First evaluate the peak signal-to-noise ratio (PSNR) between the reconstructed image and the real rendered image. Additionally, to measure the encoder's ability to capture semantically relevant details of a 3D asset, the CLIP R-Precision for reconstructed NeRF and STF renderings was re-evaluated by encoding the mesh produced by the largest Point·E model.
Table 1 below tracks the results of these two metrics at different training stages. It can be found that distillation harms the NeRF reconstruction quality, while fine-tuning not only restores but also slightly improves the NeRF quality while greatly improving the STF rendering quality.
##Compare Point・E
Researcher The proposed latent diffusion model has the same architecture, training data set, and conditional pattern as Point·E. Comparison with Point·E is more useful in distinguishing the effects of generating implicit neural representations rather than explicit representations. Figure 4 below compares these methods on sample-based evaluation metrics.
Qualitative samples are shown in Figure 5 below, and you can see that these models often generate samples of varying quality for the same text prompt. Before the end of training, the text condition Shap·E starts to get worse in evaluation.
The researchers found that Shap·E and Point·E tend to share similar failure cases, as shown in Figure 6 (a) below. This suggests that training data, model architecture, and conditioned images have a greater impact on generated samples than the chosen representation space.
We can observe that there are still some qualitative differences between the two image condition models, for example in the first row of Figure 6(b) below, Point・E ignores the bench small gaps, and Shap・E attempts to model them. This article hypothesizes that this particular discrepancy occurs because point clouds do not represent thin features or gaps well. Also observed in Table 1 is that the 3D encoder slightly reduces CLIP R-Precision when applied to Point·E samples.
##Compare with other methods
In Table 2 below, researchers compare shape・E to a wider range of 3D generation techniques on the CLIP R-Precision metric.
##Limitations and Prospects
Additionally, Shap・E produces recognizable 3D assets, but these often look grainy or lack detail. Figure 3 below shows that the encoder sometimes loses detailed textures (such as the stripes on a cactus), suggesting that an improved encoder might restore some of the lost generation quality.
Please refer to the original paper for more technical and experimental details.
The above is the detailed content of OpenAI text generation 3D model has been upgraded to complete the modeling in seconds, which is more usable than Point·E. For more information, please follow other related articles on the PHP Chinese website!