Table of Contents

Generating a large amount of random data based on Blender

Home

It only takes a few seconds to convert an ID photo into a digital person. Microsoft has achieved the first high-quality generation of 3D diffusion models, and you can change your appearance and appearance in just one sentence.

It only takes a few seconds to convert an ID photo into a digital person. Microsoft has achieved the first high-quality generation of 3D diffusion models, and you can change your appearance and appearance in just one sentence.

青灯夜游

Mar 31, 2023 pm 10:40 PM

number Model

The name of this 3D generated diffusion model "Rodin" RODIN is inspired by the French sculpture artist Auguste Rodin.

With a 2D ID photo, you can design a 3D game avatar in just a few seconds!

This is the latest achievement of diffusion model in the 3D field. For example, just an old photo of the French sculptor Rodin can "transform" him into the game in minutes:

△RODIN model is generated based on Rodin's old photo The 3D image

can even modify the dress and image with just one sentence. Tell the AI to generate Rodin's "look wearing a red sweater and glasses":

Don't like the big back? Then change to the "braided look":

Try changing your hair color again? This is a "fashionable trendy person with brown hair", even the beard color is fixed:

(The "fashionable trendy person" in the eyes of AI is indeed a bit too trendy)

The latest 3D generated diffusion model "RODIN" (Roll-out Diffusion Network) above comes from Microsoft Research Asia.

RODIN is also the first model to use the generative diffusion model to automatically generate 3D digital avatars (Avatar) on 3D training data. The paper has been accepted by

CVPR 2023.

Lets come look.

Directly use 3D data to train the diffusion model

The name of this 3D generated diffusion model "Rodin" RODIN is inspired by the French sculpture artist Auguste Rodin.

Previously, 2D generated 3D image models were usually obtained by training generative adversarial networks (GAN) or variational autoencoders (VAE) with 2D data, but the results were often unsatisfactory.

Researchers analyzed that the reason for this phenomenon is that these methods have a basic underdetermined (ill posed) problem. That is, due to the geometric ambiguity of single-view images, it is difficult to learn the reasonable distribution of high-quality 3D avatars only through a large amount of 2D data, resulting in poor generation results.

Therefore, this time they tried

to directly use 3D data to train the diffusion model, mainly solving three problems:

Secondly, high-quality and large-scale 3D image data sets are difficult to obtain, and there are privacy and copyright risks, but multi-view consistency cannot be guaranteed for 3D images published on the Internet.
Finally, the 2D diffusion model is directly extended to 3D generation, which requires huge memory, storage and computing overhead.

In order to solve these three problems, the researchers proposed the "AI Sculptor" RODIN diffusion model, which surpasses the SOTA level of existing models.

The RODIN model uses the Neural Radiation Field (NeRF) method and draws on NVIDIA's EG3D work to compactly express the 3D space into three mutually perpendicular feature planes (Triplanes) in the space, and expand these maps into a single 2D In the feature plane, 3D perceptual diffusion is then performed.

Specifically, the 3D space is expanded with two-dimensional features on three orthogonal plane views: horizontal, vertical, and vertical. This not only allows the RODIN model to use an efficient 2D architecture for 3D perception diffusion, but also Reducing the dimensionality of 3D images into 2D images also greatly reduces computational complexity and cost.

△3D-aware convolution efficiently processes 3D features

On the left side of the above figure, a triplane is used to express the 3D space. At this time, the bottom feature plane The feature points correspond to the two lines of the other two feature planes; on the right side of the above figure, 3D perceptual convolution is introduced to process the expanded 2D feature plane, taking into account the three-dimensional inherent correspondence of the three planes.

Specifically, three key elements are needed to achieve the generation of 3D images:

First, 3D-aware convolution ensures the intrinsic correlation of the three planes after dimensionality reduction.

The 2D convolutional neural network (CNN) used in traditional 2D diffusion cannot handle Triplane feature maps well.

3D-aware convolution does not simply generate three 2D feature planes, but considers its inherent three-dimensional characteristics when processing such 3D expressions, that is, the 2D features of one of the three view planes are essentially The projection of a straight line in 3D space is therefore related to the corresponding straight line projection features in the other two planes.

In order to achieve cross-plane communication, researchers consider such 3D correlations in convolution, thus efficiently synthesizing 3D details in 2D.

Second, latent space concerto three-plane 3D expression generation.

Researchers coordinate feature generation through latent vectors to make it globally consistent across the entire three-dimensional space, resulting in higher-quality avatars and semantic editing.

At the same time, an additional image encoder is also trained by using the images in the training dataset, which can extract semantic latent vectors as conditional inputs to the diffusion model.

In this way, the overall generative network can be regarded as an autoencoder, using the diffusion model as the decoding latent space vector. For semantic editability, the researchers adopted a frozen CLIP image encoder that shares the latent space with text prompts.

Third, hierarchical synthesis generates high-fidelity three-dimensional details.

The researchers used the diffusion model to first generate a low-resolution three-view plane (64×64), and then generated a high-resolution three-view plane (256×256) through diffusion upsampling. .

In this way, the basic diffusion model focuses on the overall 3D structure generation, while the subsequent upsampling model focuses on detail generation.

Generating a large amount of random data based on Blender

On the training data set, the researchers used the open source 3D rendering software Blender to randomly combine virtual 3D characters manually created by the artist images, coupled with random sampling from a large number of hair, clothes, expressions and accessories, to create 100,000 synthetic individuals, while rendering 300 multi-view images with a resolution of 256*256 for each individual.

In terms of generating text to 3D avatars, the researchers used the portrait subset of the LAION-400M data set to train the mapping from the input modality to the hidden space of the 3D diffusion model, and finally allowed the RODIN model to use only one A 2D image or a text description can create a realistic 3D avatar.

△Given a photo to generate an avatar

can not only change the image in one sentence, such as "a man with curly hair and a beard wearing a black leather jacket" ":

Even the gender can be changed at will, "Women in red clothes with African hairstyle": (Manual dog head)

The researchers also gave an application demo demonstration. Creating your own image only requires a few buttons:

△Use text to do 3D portrait editing

For more effects, you can click on the project address to view~

##△More randomly generated avatars

After making RODIN, the team’s next steps What's the plan?

According to the authors of Microsoft Research Asia, RODIN's current works mainly focus on

3D half-length portraits, which is also related to the fact that it mainly uses face data for training, but 3D image generation The demand is not limited to human faces.

Next, the team will consider trying to use RODIN models to create more 3D scenes, including flowers, trees, buildings, cars and homes, etc., to achieve the ultimate goal of "generating 3D everything with one model".

Paper address:

https://arxiv.org/abs/2212.06135

Project page:

https://3d-avatar-diffusion.microsoft.com

The above is the detailed content of It only takes a few seconds to convert an ID photo into a digital person. Microsoft has achieved the first high-quality generation of 3D diffusion models, and you can change your appearance and appearance in just one sentence.. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks ago By DDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

2 weeks ago By DDD

Where to find the Crane Control Keycard in Atomfall

3 weeks ago By DDD

Assassin's Creed Shadows - How To Find The Blacksmith And Unlock Weapon And Armour Customisation

1 months ago By DDD

Roblox: Dead Rails - How To Complete Every Challenge

3 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7609

CakePHP Tutorial

1387

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

136

Related knowledge

The world's most powerful open source MoE model is here, with Chinese capabilities comparable to GPT-4, and the price is only nearly one percent of GPT-4-Turbo May 07, 2024 pm 04:13 PM

Imagine an artificial intelligence model that not only has the ability to surpass traditional computing, but also achieves more efficient performance at a lower cost. This is not science fiction, DeepSeek-V2[1], the world’s most powerful open source MoE model is here. DeepSeek-V2 is a powerful mixture of experts (MoE) language model with the characteristics of economical training and efficient inference. It consists of 236B parameters, 21B of which are used to activate each marker. Compared with DeepSeek67B, DeepSeek-V2 has stronger performance, while saving 42.5% of training costs, reducing KV cache by 93.3%, and increasing the maximum generation throughput to 5.76 times. DeepSeek is a company exploring general artificial intelligence

Hello, electric Atlas! Boston Dynamics robot comes back to life, 180-degree weird moves scare Musk Apr 18, 2024 pm 07:58 PM

Boston Dynamics Atlas officially enters the era of electric robots! Yesterday, the hydraulic Atlas just "tearfully" withdrew from the stage of history. Today, Boston Dynamics announced that the electric Atlas is on the job. It seems that in the field of commercial humanoid robots, Boston Dynamics is determined to compete with Tesla. After the new video was released, it had already been viewed by more than one million people in just ten hours. The old people leave and new roles appear. This is a historical necessity. There is no doubt that this year is the explosive year of humanoid robots. Netizens commented: The advancement of robots has made this year's opening ceremony look like a human, and the degree of freedom is far greater than that of humans. But is this really not a horror movie? At the beginning of the video, Atlas is lying calmly on the ground, seemingly on his back. What follows is jaw-dropping

KAN, which replaces MLP, has been extended to convolution by open source projects Jun 01, 2024 pm 10:03 PM

Earlier this month, researchers from MIT and other institutions proposed a very promising alternative to MLP - KAN. KAN outperforms MLP in terms of accuracy and interpretability. And it can outperform MLP running with a larger number of parameters with a very small number of parameters. For example, the authors stated that they used KAN to reproduce DeepMind's results with a smaller network and a higher degree of automation. Specifically, DeepMind's MLP has about 300,000 parameters, while KAN only has about 200 parameters. KAN has a strong mathematical foundation like MLP. MLP is based on the universal approximation theorem, while KAN is based on the Kolmogorov-Arnold representation theorem. As shown in the figure below, KAN has

Tesla robots work in factories, Musk: The degree of freedom of hands will reach 22 this year! May 06, 2024 pm 04:13 PM

The latest video of Tesla's robot Optimus is released, and it can already work in the factory. At normal speed, it sorts batteries (Tesla's 4680 batteries) like this: The official also released what it looks like at 20x speed - on a small "workstation", picking and picking and picking: This time it is released One of the highlights of the video is that Optimus completes this work in the factory, completely autonomously, without human intervention throughout the process. And from the perspective of Optimus, it can also pick up and place the crooked battery, focusing on automatic error correction: Regarding Optimus's hand, NVIDIA scientist Jim Fan gave a high evaluation: Optimus's hand is the world's five-fingered robot. One of the most dexterous. Its hands are not only tactile

FisheyeDetNet: the first target detection algorithm based on fisheye camera Apr 26, 2024 am 11:37 AM

Target detection is a relatively mature problem in autonomous driving systems, among which pedestrian detection is one of the earliest algorithms to be deployed. Very comprehensive research has been carried out in most papers. However, distance perception using fisheye cameras for surround view is relatively less studied. Due to large radial distortion, standard bounding box representation is difficult to implement in fisheye cameras. To alleviate the above description, we explore extended bounding box, ellipse, and general polygon designs into polar/angular representations and define an instance segmentation mIOU metric to analyze these representations. The proposed model fisheyeDetNet with polygonal shape outperforms other models and simultaneously achieves 49.5% mAP on the Valeo fisheye camera dataset for autonomous driving

$The latest from Oxford University! Mickey: 2D image matching in 3D SOTA! (CVPR\'24)$ The latest from Oxford University! Mickey: 2D image matching in 3D SOTA! (CVPR\'24) Apr 23, 2024 pm 01:20 PM

Project link written in front: https://nianticlabs.github.io/mickey/ Given two pictures, the camera pose between them can be estimated by establishing the correspondence between the pictures. Typically, these correspondences are 2D to 2D, and our estimated poses are scale-indeterminate. Some applications, such as instant augmented reality anytime, anywhere, require pose estimation of scale metrics, so they rely on external depth estimators to recover scale. This paper proposes MicKey, a keypoint matching process capable of predicting metric correspondences in 3D camera space. By learning 3D coordinate matching across images, we are able to infer metric relative

Single card running Llama 70B is faster than dual card, Microsoft forced FP6 into A100 | Open source Apr 29, 2024 pm 04:55 PM

FP8 and lower floating point quantification precision are no longer the "patent" of H100! Lao Huang wanted everyone to use INT8/INT4, and the Microsoft DeepSpeed team started running FP6 on A100 without official support from NVIDIA. Test results show that the new method TC-FPx's FP6 quantization on A100 is close to or occasionally faster than INT4, and has higher accuracy than the latter. On top of this, there is also end-to-end large model support, which has been open sourced and integrated into deep learning inference frameworks such as DeepSpeed. This result also has an immediate effect on accelerating large models - under this framework, using a single card to run Llama, the throughput is 2.65 times higher than that of dual cards. one

Comprehensively surpassing DPO: Chen Danqi's team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model Jun 01, 2024 pm 04:41 PM

In order to align large language models (LLMs) with human values and intentions, it is critical to learn human feedback to ensure that they are useful, honest, and harmless. In terms of aligning LLM, an effective method is reinforcement learning based on human feedback (RLHF). Although the results of the RLHF method are excellent, there are some optimization challenges involved. This involves training a reward model and then optimizing a policy model to maximize that reward. Recently, some researchers have explored simpler offline algorithms, one of which is direct preference optimization (DPO). DPO learns the policy model directly based on preference data by parameterizing the reward function in RLHF, thus eliminating the need for an explicit reward model. This method is simple and stable

See all articles