Adobe's new technology: It only takes 30 seconds to generate 3D images with A100, making text and images move-AI-php.cn

The introduction of the 2D diffusion model greatly simplifies the creation process of image content and brings innovation to the 2D design industry. In recent years, this diffusion model has expanded into 3D creation, reducing labor costs in applications such as VR, AR, robotics, and gaming. Many studies have begun to explore the use of pre-trained 2D diffusion models, as well as NeRFs methods employing Scored Distillation Sampling (SDS) loss. However, SDS-based methods usually require hours of resource optimization and often cause geometric problems in graphics, such as the multifaceted Janus problem

On the other hand, researchers do not need to spend a lot of time on Various attempts have been made to optimize each resource and achieve diversified generation of 3D diffusion models. These methods usually require obtaining 3D models/point clouds containing real data for training. However, for real images, such training data is difficult to obtain. Since current 3D diffusion methods are usually based on two-stage training, this results in a blurry and difficult-to-denoise latent space on unclassified, highly diverse 3D datasets, making high-quality rendering an urgent challenge.

In order to solve this problem, some researchers have proposed single-stage models, but most of these models only target specific simple categories and have poor generalization

Therefore, the goal of the researchers in this article is to achieve fast, realistic and versatile 3D generation. To this end, they proposed DMV3D. DMV3D is a new single-stage all-category diffusion model that can generate 3D NeRF directly based on the input of model text or a single image. In just 30 seconds on a single A100 GPU, DMV3D can generate a variety of high-fidelity 3D images.

Adobes new technology: It only takes 30 seconds to generate 3D images with A100, making text and images move

Specifically, DMV3D is a 2D multi-view image diffusion model that integrates the reconstruction and rendering of 3D NeRF into its denoiser and trained in an end-to-end manner without direct 3D supervision. Doing so avoids problems that may arise in separately training 3D NeRF encoders for latent space diffusion (e.g. two-stage models) and tedious methods that optimize for each object (e.g. SDS)

The method in this article is essentially 3D reconstruction based on the framework of 2D multi-view diffusion. This approach is inspired by the RenderDiffusion method, a method for 3D generation via single-view diffusion. However, the limitation of the RenderDiffusion method is that the training data requires prior knowledge of a specific category, and the objects in the data require specific angles or poses, so its generalization is poor and it cannot generate 3D for any type of object

According to researchers, in comparison, only a set of four multi-view sparse projections containing an object is needed to describe an unoccluded 3D object. This training data originates from human spatial imagination, which allows people to construct a complete 3D object from planar views around several objects. This imagination is usually very precise and concrete

However, when applying this input, the task of 3D reconstruction under sparse views still needs to be solved. This is a long-standing problem that is very challenging even when the input is noisy. Our method is able to achieve 3D generation based on a single image/text. For image input, they fix one sparse view as noise-free input and perform denoising on other views similar to 2D image inpainting. To achieve text-based 3D generation, the researchers used attention-based text conditions and type-independent classifiers commonly used in 2D diffusion models.

They only used image space supervision during training and used a large dataset composed of Objaverse synthesized images and MVImgNet real captured images. According to the results, DMV3D has reached the SOTA level in single-image 3D reconstruction, surpassing previous SDS-based methods and 3D diffusion models. In addition, the text-based 3D model generation method is also better than the previous method

Adobes new technology: It only takes 30 seconds to generate 3D images with A100, making text and images move Paper address: https ://arxiv.org/pdf/2311.09217.pdf

Official website address: https://justimyhxu.github.io/projects/dmv3d/
Let’s take a look at the generated 3D image effect.

Adobes new technology: It only takes 30 seconds to generate 3D images with A100, making text and images move

Method Overview

How to train and Reasoning about single-stage 3D diffusion models?

The researchers first introduced a new diffusion framework that uses a reconstruction-based denoiser to denoise noisy multi-view images for 3D generation; secondly, they A new multi-view denoiser based on LRM, conditioned on the diffusion time step, is proposed to progressively denoise multi-view images through 3D NeRF reconstruction and rendering; finally, the model is further diffused to support Text and image adjustment for controllable generation.

The content that needs to be rewritten is: multi-view diffusion and denoising. Rewritten content: Multi-angle view diffusion and noise reduction

Multi-view diffusion. The original x_0 distribution processed in the 2D diffusion model is a single image distribution in the dataset. Instead, we consider the joint distribution of multi-view images Adobes new technology: It only takes 30 seconds to generate 3D images with A100, making text and images move , where each group is viewed from viewpoint C = {c_1, .. ., c_N} Image observations of the same 3D scene (asset) in . The diffusion process is equivalent to performing a diffusion operation on each image independently using the same noise schedule, as shown in equation (1) below.

Adobes new technology: It only takes 30 seconds to generate 3D images with A100, making text and images move

Reconstruction-based denoising. The inverse of the 2D diffusion process is essentially denoising. In this paper, researchers propose to use 3D reconstruction and rendering to achieve 2D multi-view image denoising while outputting clean 3D models for 3D generation. Specifically, they use the 3D reconstruction module E (・) to reconstruct the 3D representation S from the noisy multi-view image Adobes new technology: It only takes 30 seconds to generate 3D images with A100, making text and images move , and use the differentiable rendering module R (・) to reconstruct the denoised image Rendering is performed as shown in formula (2) below.

Adobes new technology: It only takes 30 seconds to generate 3D images with A100, making text and images move

Reconstruction-based multi-view denoiser

The researchers built a multi-view denoiser based on LRM and used a large transformer model to reconstruct a clean three-plane NeRF from the noisy sparse view pose image, and then used the rendering of the reconstructed three-plane NeRF as Denoised output.

Rebuild and render. As shown in Figure 3 below, the researcher uses a Vision Transformer (DINO) to convert the input image Adobes new technology: It only takes 30 seconds to generate 3D images with A100, making text and images move into a 2D token, and then uses the transformer to map the learned three-plane position embedding to the final three-plane , to represent the 3D shape and appearance of the asset. The predicted three planes are next used to decode volume density and color via an MLP for differentiable volume rendering.

Adobes new technology: It only takes 30 seconds to generate 3D images with A100, making text and images move

Time adjustment. Compared with the CNN-based DDPM (Denoising Diffusion Probabilistic Model), our transformer-based model requires a different temporal adjustment design.

When training the model in this article, the researchers pointed out that on highly diverse camera intrinsic and extrinsic parameter data sets (such as MVImgNet), it is necessary to effectively design input camera adjustments to help The model understands the camera and performs 3D reasoning

When rewriting the content, the language of the original text needs to be converted into Chinese without changing the meaning of the original text

The above method enables the model proposed by the researcher to serve as an unconditional generative model. They describe how to use conditional denoisers Adobes new technology: It only takes 30 seconds to generate 3D images with A100, making text and images move to model conditional probability distributions, where y represents text or images, to achieve controlled 3D generation.

In terms of image conditioning, the researchers proposed a simple and effective strategy that does not require modifications to the model's architecture

Text conditioning . To add text conditioning to their model, the researchers adopted a strategy similar to Stable Diffusion. They use a CLIP text encoder to generate text embeddings and inject them into a denoiser using cross-attention.

The content that needs to be rewritten is: training and inference

Training. During the training phase, we sample time steps t uniformly within the range [1, T] and add noise according to cosine scheduling. They sample the input image using random camera poses and also randomly sample additional new viewpoints to supervise the rendering for better quality.

The researcher uses the conditional signal y to minimize the training objective

Adobes new technology: It only takes 30 seconds to generate 3D images with A100, making text and images move

reasoning. During the inference phase, we selected viewpoints that evenly surrounded the object in a circle to ensure good coverage of the resulting 3D assets. They fixed the camera market angle at 50 degrees for the four views.

Experimental results

In the experiment, the researchers used the AdamW optimizer to train their model with an initial learning rate of 4e^- 4. They used 3K steps of warm-up and cosine decay for this learning rate, used 256×256 input images to train the denoising model, and used 128×128 cropped images for supervised rendering

The content about the data set needs to be rewritten as: the researcher's model only needs to use multi-view pose images for training. Therefore, they used rendered multi-view images of approximately 730k objects from the Objaverse dataset. For each object, they performed 32 image renderings at random viewpoints with a fixed 50-degree FOV according to the LRM settings, and performed uniform illumination

First, single image reconstruction . The researchers compared their image-conditioning model with previous methods such as Point-E, Shap-E, Zero-1-to-3, and Magic123 on a single image reconstruction task. They used metrics such as PSNR, LPIPS, CLIP similarity score, and FID to evaluate the new view rendering quality of all methods.

The quantitative results on the GSO and ABO test sets are shown in Table 1 below. Our model outperforms all baseline methods and achieves new SOTA for all metrics on both datasets

Adobes new technology: It only takes 30 seconds to generate 3D images with A100, making text and images move

The results generated by the model in this article have higher quality than the baseline in terms of geometry and appearance details. This result can be qualitatively demonstrated through Figure 4

DMV3D is a 2D image-based In contrast, it eliminates the need to optimize each asset individually while eliminating multi-view diffusion noise and directly generating 3D NeRF models. Overall, DMV3D is able to generate 3D images quickly and achieve the best single-image 3D reconstruction results

Adobes new technology: It only takes 30 seconds to generate 3D images with A100, making text and images move

Rewritten As follows: The researchers also evaluated DMV3D on text-based 3D generation results. The researchers compared DMV3D with Shap-E and Point-E, which also support fast reasoning across all categories. The researchers let the three models generate based on 50 text prompts from Shap-E, and used the CLIP accuracy and average accuracy of two different ViT models to evaluate the generation results, as shown in Table 2

According to the data in the table, DMV3D shows the best accuracy. As can be seen from the qualitative results in Figure 5, compared to the results generated by other models, the graphics generated by DMV3D clearly contain richer geometric and appearance details, and the results are more realistic

Adobes new technology: It only takes 30 seconds to generate 3D images with A100, making text and images move

What needs to be rewritten is: other results

In terms of perspective, the researcher shows in Table 3 and Figure 8 Quantitative and qualitative comparison of models trained with different numbers of input views (1, 2, 4, 6).

Adobes new technology: It only takes 30 seconds to generate 3D images with A100, making text and images move

#In terms of multi-instance generation, similar to other diffusion models, the model proposed in this article can generate a variety of examples based on random input, as shown in Figure 1, which demonstrates the generalization of the results generated by the model.

Adobes new technology: It only takes 30 seconds to generate 3D images with A100, making text and images move

DMV3D has extensive flexibility and versatility in applications, and has strong development potential in the field of 3D generation applications. As shown in Figures 1 and 2, our method can lift any object in a 2D photo to a 3D dimension in image editing applications through methods such as segmentation (such as SAM)

Please read the original paper for more technical details and experimental results

Adobes new technology: It only takes 30 seconds to generate 3D images with A100, making text and images move

The above is the detailed content of Adobe's new technology: It only takes 30 seconds to generate 3D images with A100, making text and images move. For more information, please follow other related articles on the PHP Chinese website!