Home > Technology peripherals > AI > body text

The TF-T2V technology jointly developed by Huake, Ali and other companies reduces the cost of AI video production!

WBOY
Release: 2024-01-11 16:12:20
forward
1157 people have browsed it

In the past two years, with the opening of large-scale graphic and text data sets such as LAION-5B, a series of methods with amazing effects have emerged in the field of image generation, such as Stable Diffusion, DALL-E 2, ControlNet and Composer . The emergence of these methods has made great breakthroughs and progress in the field of image generation. The field of image generation has developed rapidly in just the past two years.

However, video generation still faces huge challenges. First, compared with image generation, video generation needs to process higher-dimensional data and needs to take into account the additional time dimension, which brings about the problem of timing modeling. To drive learning of temporal dynamics, we need more video-text pair data. However, accurate temporal annotation of videos is very expensive, which limits the size of video-text datasets. Currently, the existing WebVid10M video dataset only contains 10.7M video-text pairs. Compared with the LAION-5B image dataset, the data size is far different. This severely restricts the possibility of large-scale expansion of video generation models.

In order to solve the above problems, the joint research team of Huazhong University of Science and Technology, Alibaba Group, Zhejiang University and Ant Group recently released the TF-T2V video solution:

The TF-T2V technology jointly developed by Huake, Ali and other companies reduces the cost of AI video production!


##Paper address: https://arxiv.org/abs/2312.15770

Project Home page: https://tf-t2v.github.io/

Source code will be released soon: https://github.com/ali-vilab/i2vgen-xl (VGen project) .

This solution takes a different approach and proposes video generation based on large-scale text-free annotated video data, which can learn rich motion dynamics.

Let’s first take a look at the video generation effect of TF-T2V:

文生视频 Task

Prompt word: Generate a video of a large frost-like creature on a snow-covered land.

The TF-T2V technology jointly developed by Huake, Ali and other companies reduces the cost of AI video production!

Prompt word: Generate an animated video of a cartoon bee.

The TF-T2V technology jointly developed by Huake, Ali and other companies reduces the cost of AI video production!

Prompt word: Generate a video containing a futuristic fantasy motorcycle.

The TF-T2V technology jointly developed by Huake, Ali and other companies reduces the cost of AI video production!

Prompt word: Generate a video of a little boy smiling happily.

The TF-T2V technology jointly developed by Huake, Ali and other companies reduces the cost of AI video production!

Prompt words: Generate a video of an old man feeling a headache.

The TF-T2V technology jointly developed by Huake, Ali and other companies reduces the cost of AI video production!

Combined video generation task

Given text and depth map Or text and sketches, TF-T2V can perform controllable video generation:

The TF-T2V technology jointly developed by Huake, Ali and other companies reduces the cost of AI video production!

It can also perform high-resolution video synthesis:

The TF-T2V technology jointly developed by Huake, Ali and other companies reduces the cost of AI video production!

The TF-T2V technology jointly developed by Huake, Ali and other companies reduces the cost of AI video production!


##

Semi-supervised setting

The TF-T2V method in the semi-supervised setting can also generate videos that conform to the description of motion text, such as "People run from right to left."

The TF-T2V technology jointly developed by Huake, Ali and other companies reduces the cost of AI video production!

The TF-T2V technology jointly developed by Huake, Ali and other companies reduces the cost of AI video production!

Method introduction

The core idea of ​​TF-T2V The model is divided into a motion branch and an appearance branch. The motion branch is used to model motion dynamics, and the appearance branch is used to learn visual appearance information. These two branches are trained jointly, and finally can achieve text-driven video generation.

In order to improve the temporal consistency of generated videos, the author team also proposed a temporal consistency loss to explicitly learn the continuity between video frames.

The TF-T2V technology jointly developed by Huake, Ali and other companies reduces the cost of AI video production!

It is worth mentioning that TF-T2V is a general framework that is not only suitable for Vincent video tasks, but also for combined Video generation tasks, such as sketch-to-video, video inpainting, first frame-to-video, etc.

For specific details and more experimental results, please refer to the original paper or the project homepage.

In addition, the author team also used TF-T2V as a teacher model and used consistent distillation technology to obtain the VideoLCM model:

The TF-T2V technology jointly developed by Huake, Ali and other companies reduces the cost of AI video production!

Paper address: https://arxiv.org/abs/2312.09109

Project homepage: https://tf-t2v.github.io/

The source code will be released soon: https://github.com/ali-vilab/i2vgen-xl (VGen project).

Unlike previous video generation methods that require about 50 steps of DDIM denoising, the VideoLCM method based on TF-T2V can generate high-fidelity videos with only about 4 steps of inference denoising. , greatly improving the efficiency of video generation.

Let’s take a look at the results of VideoLCM’s 4-step denoising inference:

The TF-T2V technology jointly developed by Huake, Ali and other companies reduces the cost of AI video production!

The TF-T2V technology jointly developed by Huake, Ali and other companies reduces the cost of AI video production!

The TF-T2V technology jointly developed by Huake, Ali and other companies reduces the cost of AI video production!

For specific details and more experimental results, please refer to the original VideoLCM paper or the project homepage.

In short, the TF-T2V solution brings new ideas to the field of video generation and overcomes the challenges caused by data set size and labeling difficulties. Leveraging large-scale text-free annotation video data, TF-T2V is able to generate high-quality videos and is applied to a variety of video generation tasks. This innovation will promote the development of video generation technology and bring broader application scenarios and business opportunities to all walks of life.

The above is the detailed content of The TF-T2V technology jointly developed by Huake, Ali and other companies reduces the cost of AI video production!. For more information, please follow other related articles on the PHP Chinese website!

Related labels:
source:51cto.com
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template
About us Disclaimer Sitemap
php.cn:Public welfare online PHP training,Help PHP learners grow quickly!