A line of text generates a 3D dynamic scene: Meta's 'one step” model is quite powerful-AI-php.cn

A line of text generates a 3D dynamic scene: Meta's 'one step” model is quite powerful

王林

Release： 2023-05-05 14:55:06

forward

1062 people have browsed it

Just input a line of text to generate a 3D dynamic scene?

Yes, some researchers have already done it. It can be seen that the current generation effect is still in its infancy and can only generate some simple objects. However, this "one-step" method still attracts the attention of a large number of researchers:

A line of text generates a 3D dynamic scene: Metas one step” model is quite powerful

In a recent article In the paper, researchers from Meta proposed for the first time MAV3D (Make-A-Video3D), a method that can generate three-dimensional dynamic scenes from text descriptions.

A line of text generates a 3D dynamic scene: Metas one step” model is quite powerful

## Paper link: https://arxiv.org/abs/2301.11280
Project link: https://make-a-video3d.github.io/

Specifically, the method uses 4D dynamic Neural Radiation Fields (NeRF) to optimize the consistency of scene appearance, density, and motion by querying a text-to-video (T2V) diffusion-based model. The dynamic video output generated by the provided text can be viewed from any camera angle or angle and can be synthesized into any 3D environment.

MAV3D does not require any 3D or 4D data, the T2V model is trained only on text-image pairs and unlabeled videos.

A line of text generates a 3D dynamic scene: Metas one step” model is quite powerful

##Let’s take a look at the effect of MAV3D generating 4D dynamic scenes from text:

A line of text generates a 3D dynamic scene: Metas one step” model is quite powerful

In addition, it can also go directly from the image to 4D, the effect is as follows:

A line of text generates a 3D dynamic scene: Metas one step” model is quite powerful

##The researchers proved through comprehensive quantitative and qualitative experiments The effectiveness of the method,on previously established internal baselines has also been,improved. It is reported that this is the first method to generate 3D dynamic scenes based on text descriptions.

Method

The goal of this research is to develop a method for generating dynamic 3D scene representations from natural language descriptions. This is extremely challenging because there are neither text or 3D pairs nor dynamic 3D scene data for training. Therefore, we chose to rely on a pre-trained text-to-video (T2V) diffusion model as the scene prior, which has learned to model the realistic appearance and motion of the scene through training on large-scale image, text, and video data.

From a higher level, given a text prompt p, research can fit a 4D representation that simulates the appearance of the scene matching the prompt at any point in space and time. Without paired training data, the study cannot directly supervise the output of A line of text generates a 3D dynamic scene: Metas one step” model is quite powerful ; however, given a sequence of camera poses You can render the image sequence from and stack them into a video V. The text prompt p and video V are then passed to the frozen and pre-trained T2V diffusion model, which scores the authenticity and prompt alignment of the video and uses SDS (Score Distillation Sampling) to calculate the update direction of the scene parameter θ .

The above pipeline can be counted as an extension of DreamFusion, adding a temporal dimension to the scene model and using a T2V model instead of a text-to-image (T2I) model for supervision. However, achieving high-quality text-to-4D generation requires more innovation:

First, new methods that allow flexible scene motion modeling are needed. 4D representation;
Second, a multi-level static to dynamic optimization scheme is needed to improve video quality and improve model convergence. This scheme uses several motion regularizers to generate real motion;
Third, super-resolution fine-tuning (SRFT) needs to be used to improve the resolution of the model.

See the picture below for specific instructions:

A line of text generates a 3D dynamic scene: Metas one step” model is quite powerful

##Experiment

In the experiment, the researchers evaluated MAV3D's ability to generate dynamic scenes from text descriptions. First, the researchers evaluated the effectiveness of the method on the Text-To-4D task. It is reported that MAV3D is the first solution to this task, so the research developed three alternative methods as baselines. Second, we evaluate simplified versions of the T2V and Text-To-3D subtask models and compare them with existing baselines in the literature. Third, comprehensive ablation studies justify the method design. Fourth, experiments describe the process of converting dynamic NeRF to dynamic meshes, ultimately extending the model to Image-to-4D tasks.

Metrics

Study to evaluate generated videos using CLIP R-Precision, which measures text and generated scenes consistency between. The reported metric is the accuracy of retrieving the input prompt from the rendered frame. We used the ViT-B/32 variant of CLIP and extracted frames at different views and time steps, and also used four qualitative metrics by asking human raters for their preferences across two generated videos, respectively. are: (i) video quality; (ii) fidelity to text prompts; (iii) amount of activity; (iv) realism of movement. We evaluated all baselines and ablations used in text prompt segmentation.

Figure 1 and Figure 2 are examples. For more detailed visualizations, see make-a-video3d.github.io.

A line of text generates a 3D dynamic scene: Metas one step” model is quite powerful

Results

Table 1 shows Comparison to baseline (R - accuracy and human preference). Human reviews are presented as the percentage of votes favoring the baseline majority compared to the model in a specific environment.

A line of text generates a 3D dynamic scene: Metas one step” model is quite powerful

Table 2 shows the results of the ablation experiment:

A line of text generates a 3D dynamic scene: Metas one step” model is quite powerful

Real-time rendering

Applications such as virtual reality and games that use traditional graphics engines require a standard format, Such as texture mesh. HexPlane models can be easily converted into animated meshes as shown below. First, a simple mesh is extracted from the opacity field generated at each time t using the marching cube algorithm, followed by mesh extraction (for efficiency) and removal of small noisy connected components. The XATLAS algorithm is used to map mesh vertices to a texture atlas, with the texture initialized using the average HexPlane color in a small sphere centered at each vertex. Finally, the textures are further optimized to better match some example frames rendered by HexPlane using differentiable meshes. This will produce a collection of texture meshes that can be played back in any off-the-shelf 3D engine.

Image to 4D

Figure 6 and Figure 10 show the method that can be generated from a given input image Depth and motion, resulting in 4D assets.

A line of text generates a 3D dynamic scene: Metas one step” model is quite powerful

##For more research details, please refer to the original paper.

The above is the detailed content of A line of text generates a 3D dynamic scene: Meta's 'one step” model is quite powerful. For more information, please follow other related articles on the PHP Chinese website!