In the past two years, with the opening of large-scale graphic and text data sets such as LAION-5B, a series of methods with amazing effects have emerged in the field of image generation, such as Stable Diffusion, DALL-E 2, ControlNet and Composer . The emergence of these methods has made great breakthroughs and progress in the field of image generation. The field of image generation has developed rapidly in just the past two years.
However, video generation still faces huge challenges. First, compared with image generation, video generation needs to process higher-dimensional data and needs to take into account the additional time dimension, which brings about the problem of timing modeling. To drive learning of temporal dynamics, we need more video-text pair data. However, accurate temporal annotation of videos is very expensive, which limits the size of video-text datasets. Currently, the existing WebVid10M video dataset only contains 10.7M video-text pairs. Compared with the LAION-5B image dataset, the data size is far different. This severely restricts the possibility of large-scale expansion of video generation models.
In order to solve the above problems, the joint research team of Huazhong University of Science and Technology, Alibaba Group, Zhejiang University and Ant Group recently released the TF-T2V video solution:
##Paper address: https://arxiv.org/abs/2312.15770
Project Home page: https://tf-t2v.github.io/
Source code will be released soon: https://github.com/ali-vilab/i2vgen-xl (VGen project) .
This solution takes a different approach and proposes video generation based on large-scale text-free annotated video data, which can learn rich motion dynamics.
Let’s first take a look at the video generation effect of TF-T2V:
文生视频 Task
Prompt word: Generate a video of a large frost-like creature on a snow-covered land.
Prompt word: Generate an animated video of a cartoon bee.
Prompt word: Generate a video containing a futuristic fantasy motorcycle.
Prompt word: Generate a video of a little boy smiling happily.
Prompt words: Generate a video of an old man feeling a headache.
Combined video generation task
Given text and depth map Or text and sketches, TF-T2V can perform controllable video generation:
It can also perform high-resolution video synthesis:
##
Semi-supervised setting
The TF-T2V method in the semi-supervised setting can also generate videos that conform to the description of motion text, such as "People run from right to left."
The core idea of TF-T2V The model is divided into a motion branch and an appearance branch. The motion branch is used to model motion dynamics, and the appearance branch is used to learn visual appearance information. These two branches are trained jointly, and finally can achieve text-driven video generation.
In order to improve the temporal consistency of generated videos, the author team also proposed a temporal consistency loss to explicitly learn the continuity between video frames.
It is worth mentioning that TF-T2V is a general framework that is not only suitable for Vincent video tasks, but also for combined Video generation tasks, such as sketch-to-video, video inpainting, first frame-to-video, etc.
For specific details and more experimental results, please refer to the original paper or the project homepage.
In addition, the author team also used TF-T2V as a teacher model and used consistent distillation technology to obtain the VideoLCM model:
Paper address: https://arxiv.org/abs/2312.09109
Project homepage: https://tf-t2v.github.io/
The source code will be released soon: https://github.com/ali-vilab/i2vgen-xl (VGen project).
Unlike previous video generation methods that require about 50 steps of DDIM denoising, the VideoLCM method based on TF-T2V can generate high-fidelity videos with only about 4 steps of inference denoising. , greatly improving the efficiency of video generation.
Let’s take a look at the results of VideoLCM’s 4-step denoising inference:
For specific details and more experimental results, please refer to the original VideoLCM paper or the project homepage.
In short, the TF-T2V solution brings new ideas to the field of video generation and overcomes the challenges caused by data set size and labeling difficulties. Leveraging large-scale text-free annotation video data, TF-T2V is able to generate high-quality videos and is applied to a variety of video generation tasks. This innovation will promote the development of video generation technology and bring broader application scenarios and business opportunities to all walks of life.
The above is the detailed content of The TF-T2V technology jointly developed by Huake, Ali and other companies reduces the cost of AI video production!. For more information, please follow other related articles on the PHP Chinese website!