Home > Technology peripherals > AI > It only takes a few seconds to convert an ID photo into a digital person. Microsoft has achieved the first high-quality generation of 3D diffusion models, and you can change your appearance and appearance in just one sentence.

It only takes a few seconds to convert an ID photo into a digital person. Microsoft has achieved the first high-quality generation of 3D diffusion models, and you can change your appearance and appearance in just one sentence.

青灯夜游
Release: 2023-03-31 22:40:41
forward
1281 people have browsed it

The name of this 3D generated diffusion model "Rodin" RODIN is inspired by the French sculpture artist Auguste Rodin.

With a 2D ID photo, you can design a 3D game avatar in just a few seconds!

This is the latest achievement of diffusion model in the 3D field. For example, just an old photo of the French sculptor Rodin can "transform" him into the game in minutes:

It only takes a few seconds to convert an ID photo into a digital person. Microsoft has achieved the first high-quality generation of 3D diffusion models, and you can change your appearance and appearance in just one sentence.
△RODIN model is generated based on Rodin's old photo The 3D image

can even modify the dress and image with just one sentence. Tell the AI ​​to generate Rodin's "look wearing a red sweater and glasses":

It only takes a few seconds to convert an ID photo into a digital person. Microsoft has achieved the first high-quality generation of 3D diffusion models, and you can change your appearance and appearance in just one sentence.

Don't like the big back? Then change to the "braided look":

It only takes a few seconds to convert an ID photo into a digital person. Microsoft has achieved the first high-quality generation of 3D diffusion models, and you can change your appearance and appearance in just one sentence.

Try changing your hair color again? This is a "fashionable trendy person with brown hair", even the beard color is fixed:

It only takes a few seconds to convert an ID photo into a digital person. Microsoft has achieved the first high-quality generation of 3D diffusion models, and you can change your appearance and appearance in just one sentence.

(The "fashionable trendy person" in the eyes of AI is indeed a bit too trendy)

The latest 3D generated diffusion model "RODIN" (Roll-out Diffusion Network) above comes from Microsoft Research Asia.

RODIN is also the first model to use the generative diffusion model to automatically generate 3D digital avatars (Avatar) on 3D training data. The paper has been accepted by

CVPR 2023.

Lets come look.

Directly use 3D data to train the diffusion model

The name of this 3D generated diffusion model "Rodin" RODIN is inspired by the French sculpture artist Auguste Rodin.

Previously, 2D generated 3D image models were usually obtained by training generative adversarial networks (GAN) or variational autoencoders (VAE) with 2D data, but the results were often unsatisfactory.

Researchers analyzed that the reason for this phenomenon is that these methods have a basic underdetermined (ill posed) problem. That is, due to the geometric ambiguity of single-view images, it is difficult to learn the reasonable distribution of high-quality 3D avatars only through a large amount of 2D data, resulting in poor generation results.

Therefore, this time they tried

to directly use 3D data to train the diffusion model, mainly solving three problems:

    First, how to use the diffusion model to generate 3D model multi-view diagram. Previously, there were no practical methods and precedents to follow for diffusion models on 3D data.
  • Secondly, high-quality and large-scale 3D image data sets are difficult to obtain, and there are privacy and copyright risks, but multi-view consistency cannot be guaranteed for 3D images published on the Internet.
  • Finally, the 2D diffusion model is directly extended to 3D generation, which requires huge memory, storage and computing overhead.
In order to solve these three problems, the researchers proposed the "AI Sculptor" RODIN diffusion model, which surpasses the SOTA level of existing models.

The RODIN model uses the Neural Radiation Field (NeRF) method and draws on NVIDIA's EG3D work to compactly express the 3D space into three mutually perpendicular feature planes (Triplanes) in the space, and expand these maps into a single 2D In the feature plane, 3D perceptual diffusion is then performed.

Specifically, the 3D space is expanded with two-dimensional features on three orthogonal plane views: horizontal, vertical, and vertical. This not only allows the RODIN model to use an efficient 2D architecture for 3D perception diffusion, but also Reducing the dimensionality of 3D images into 2D images also greatly reduces computational complexity and cost.

It only takes a few seconds to convert an ID photo into a digital person. Microsoft has achieved the first high-quality generation of 3D diffusion models, and you can change your appearance and appearance in just one sentence.

△3D-aware convolution efficiently processes 3D features

On the left side of the above figure, a triplane is used to express the 3D space. At this time, the bottom feature plane The feature points correspond to the two lines of the other two feature planes; on the right side of the above figure, 3D perceptual convolution is introduced to process the expanded 2D feature plane, taking into account the three-dimensional inherent correspondence of the three planes.

Specifically, three key elements are needed to achieve the generation of 3D images:

First, 3D-aware convolution ensures the intrinsic correlation of the three planes after dimensionality reduction.

The 2D convolutional neural network (CNN) used in traditional 2D diffusion cannot handle Triplane feature maps well.

3D-aware convolution does not simply generate three 2D feature planes, but considers its inherent three-dimensional characteristics when processing such 3D expressions, that is, the 2D features of one of the three view planes are essentially The projection of a straight line in 3D space is therefore related to the corresponding straight line projection features in the other two planes.

In order to achieve cross-plane communication, researchers consider such 3D correlations in convolution, thus efficiently synthesizing 3D details in 2D.

Second, latent space concerto three-plane 3D expression generation.

Researchers coordinate feature generation through latent vectors to make it globally consistent across the entire three-dimensional space, resulting in higher-quality avatars and semantic editing.

At the same time, an additional image encoder is also trained by using the images in the training dataset, which can extract semantic latent vectors as conditional inputs to the diffusion model.

In this way, the overall generative network can be regarded as an autoencoder, using the diffusion model as the decoding latent space vector. For semantic editability, the researchers adopted a frozen CLIP image encoder that shares the latent space with text prompts.

Third, hierarchical synthesis generates high-fidelity three-dimensional details.

The researchers used the diffusion model to first generate a low-resolution three-view plane (64×64), and then generated a high-resolution three-view plane (256×256) through diffusion upsampling. .

In this way, the basic diffusion model focuses on the overall 3D structure generation, while the subsequent upsampling model focuses on detail generation.

Generating a large amount of random data based on Blender

On the training data set, the researchers used the open source 3D rendering software Blender to randomly combine virtual 3D characters manually created by the artist images, coupled with random sampling from a large number of hair, clothes, expressions and accessories, to create 100,000 synthetic individuals, while rendering 300 multi-view images with a resolution of 256*256 for each individual.

In terms of generating text to 3D avatars, the researchers used the portrait subset of the LAION-400M data set to train the mapping from the input modality to the hidden space of the 3D diffusion model, and finally allowed the RODIN model to use only one A 2D image or a text description can create a realistic 3D avatar.

It only takes a few seconds to convert an ID photo into a digital person. Microsoft has achieved the first high-quality generation of 3D diffusion models, and you can change your appearance and appearance in just one sentence.

△Given a photo to generate an avatar

can not only change the image in one sentence, such as "a man with curly hair and a beard wearing a black leather jacket" ":

It only takes a few seconds to convert an ID photo into a digital person. Microsoft has achieved the first high-quality generation of 3D diffusion models, and you can change your appearance and appearance in just one sentence.

Even the gender can be changed at will, "Women in red clothes with African hairstyle": (Manual dog head)

It only takes a few seconds to convert an ID photo into a digital person. Microsoft has achieved the first high-quality generation of 3D diffusion models, and you can change your appearance and appearance in just one sentence.

The researchers also gave an application demo demonstration. Creating your own image only requires a few buttons:

It only takes a few seconds to convert an ID photo into a digital person. Microsoft has achieved the first high-quality generation of 3D diffusion models, and you can change your appearance and appearance in just one sentence.

△Use text to do 3D portrait editing

For more effects, you can click on the project address to view~

It only takes a few seconds to convert an ID photo into a digital person. Microsoft has achieved the first high-quality generation of 3D diffusion models, and you can change your appearance and appearance in just one sentence.

##△More randomly generated avatars

After making RODIN, the team’s next steps What's the plan?

According to the authors of Microsoft Research Asia, RODIN's current works mainly focus on

3D half-length portraits, which is also related to the fact that it mainly uses face data for training, but 3D image generation The demand is not limited to human faces.

Next, the team will consider trying to use RODIN models to create more 3D scenes, including flowers, trees, buildings, cars and homes, etc., to achieve the ultimate goal of "generating 3D everything with one model".

Paper address:

https://arxiv.org/abs/2212.06135

Project page:

https://3d-avatar-diffusion.microsoft.com

It only takes a few seconds to convert an ID photo into a digital person. Microsoft has achieved the first high-quality generation of 3D diffusion models, and you can change your appearance and appearance in just one sentence.


The above is the detailed content of It only takes a few seconds to convert an ID photo into a digital person. Microsoft has achieved the first high-quality generation of 3D diffusion models, and you can change your appearance and appearance in just one sentence.. For more information, please follow other related articles on the PHP Chinese website!

Related labels:
source:51cto.com
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template