Google has released a new video framework:
You only need a picture of your face and a recording of your speech to get a lifelike video of your speech.
The video duration is variable, and the current example seen is up to 10s.
You can see that whether it is mouth shape or facial expression, it is very natural.
If the input image covers the entire upper body, it can also be matched with rich gestures:
After reading it, netizens said:
Yes With it, we no longer need to fix our hair and get dressed for online video conferences in the future.
Well, just take a portrait and record the speech audio (manual dog head)
Use your voice to control the portrait to generate a video
This framework is called VLOGGER.
It is mainly based on the diffusion model and contains two parts:
One is a random human-to-3d-motion diffusion model.
The other is a new diffusion architecture for enhancing text-to-image models.
Among them, the former is responsible for using the audio waveform as input to generate the character's body control actions, including eyes, expressions and gestures, overall body posture, etc.
The latter is a temporal dimension image-to-image model that is used to extend the large-scale image diffusion model and use the just predicted actions to generate corresponding frames.
In order to make the results conform to a specific character image, VLOGGER also takes the pose map of the parameter image as input.
The training of VLOGGER is completed on a very large data set (named MENTOR).
How big is it? It is 2,200 hours long and contains a total of 800,000 character videos.
Among them, the video duration of the test set is also 120 hours long, with a total of 4,000 characters.
According to Google, the most outstanding performance of VLOGGER is its diversity:
As shown in the figure below, the darker (red) the color of the final pixel image, the richer the actions.
Compared with previous similar methods in the industry, the biggest advantage of VLOGGER is that it does not need to train everyone, does not rely on face detection and cropping, and The generated video is complete (including both face and lips, body movements) and more.
Specifically, as shown in the following table:
The Face Reenactment method cannot use audio and text to control such video generation.
Audio-to-motion can generate audio by encoding audio into 3D facial movements, but the effect it generates is not realistic enough.
Lip sync can handle videos of different themes, but it can only simulate mouth movements.
In comparison, the latter two methods, SadTaker and Styletalk, perform closest to Google VLOGGER, but they are also defeated by the inability to control the body and further edit the video.
Speaking of video editing, as shown in the figure below, one of the applications of the VLOGGER model is this. It can make the character shut up, close eyes, only close the left eye, or open the whole eye with one click:
Another application is video translation:
For example, change the English speech of the original video into Spanish with the same mouth shape.
Netizens complained
Finally, according to the "old rule", Google did not release the model, and all that can be seen now are more effects and papers.
Well, there are a lot of complaints:
The image quality of the model, the mouth shape is not right, it still looks very robotic, etc.
Therefore, some people do not hesitate to leave negative reviews:
Is this the level of Google?
I'm a bit sorry for the name "VLOGGER".
——Compared with OpenAI’s Sora, the netizen’s statement is indeed not unreasonable. .
What do you think?
More effects:
https://enriccorona.github.io/vlogger/
Full paper:
https://enriccorona.github .io/vlogger/paper.pdf
The above is the detailed content of Google releases 'Vlogger” model: a single picture generates a 10-second video. For more information, please follow other related articles on the PHP Chinese website!