Home > Technology peripherals > AI > Kuaishou version of Sora 'Ke Ling' is open for testing: generates over 120s video, understands physics better, and can accurately model complex movements

Kuaishou version of Sora 'Ke Ling' is open for testing: generates over 120s video, understands physics better, and can accurately model complex movements

WBOY
Release: 2024-06-11 09:51:48
Original
694 people have browsed it

What? Is Zootopia brought into reality by domestic AI?

Kuaishou version of Sora Ke Ling is open for testing: generates over 120s video, understands physics better, and can accurately model complex movements

# Exposed together with the video is a new large domestic video generation model called "Keling".

Sora uses a similar technical route and combines a number of self-developed technological innovations to produce videos that not only have large and reasonable movements, but also simulate the characteristics of the physical world and have strong conceptual combination capabilities and imagination.

According to the data, Keling supports the generation of super long videos up to 2 minutes#30fps, resolution The resolution is up to 1080p and supports multiple aspect ratios.

Kuaishou version of Sora Ke Ling is open for testing: generates over 120s video, understands physics better, and can accurately model complex movementsAnother important point is that Keling is not a demo or video result demonstration released by the laboratory, but a product launched by Kuaishou, a leading player in the short video field

Product Level Application. And the main thing is to be pragmatic, not to write blank checks,

to be launched immediately after release, and the big model of Keling is already in Kuaiying APP Officially opened the invitation test. Without further ado, let me show you Ke Ling’s masterpiece~

Understand the laws of the world better, and can accurately depict complex movements

I believe that through the beginning Through the video, everyone has already experienced Ke Ling’s rich imagination.

Keling is not only imaginative and unconstrained, but also able to conform to the real laws of motion when depicting motion.

Complex and large-scale space-time motion can also be accurate portray. For example, this tiger running at high speed on the road not only has a coherent picture, reasonable changes with the camera angle, and coordinated movements of the tiger's limbs, but also vividly displays the shaking of the torso during running.

Kuaishou version of Sora Ke Ling is open for testing: generates over 120s video, understands physics better, and can accurately model complex movementsThere is also a scene of astronauts running on the moon, the movements are smooth, the gait and shadow movement are reasonable and appropriate, it is amazing.

Kuaishou version of Sora Ke Ling is open for testing: generates over 120s video, understands physics better, and can accurately model complex movementsIn addition to movement, Keling large models can also

simulate the characteristics of the real physical world, and the generated videos are more in line with the laws of physics. In this video of pouring milk, the mechanical

laws of gravity and the rise of the liquid level are all in line with reality. Even the foam is always on the top when pouring the liquid. Characteristics of the The cat's paws and keys in the shadow on the smooth surface are all changing synchronously with the main body.

Kuaishou version of Sora Ke Ling is open for testing: generates over 120s video, understands physics better, and can accurately model complex movementsIn addition,

interaction with the real physical world can also be truly reflected - the little boy eating a hamburger in the video below In the generated video, when you take a bite, the teeth marks are always there, and the little boy enjoys eating the burger as if he is right in front of his eyes.

Kuaishou version of Sora Ke Ling is open for testing: generates over 120s video, understands physics better, and can accurately model complex movementsYou must know that conforming to the laws of physics is still quite difficult for large models, and even Sora cannot completely do it.

For example, in the same scene of eating a burger, the video generated by Sora not only has the disadvantage of having only three fingers on a human hand, but also the bite position does not match the bite marks on the burger...

Not only the physical laws and movements in the real world, but also the

Kuaishou version of Sora Ke Ling is open for testing: generates over 120s video, understands physics better, and can accurately model complex movementsimaginative

scenes, Ke Ling can also easily grasp them.

For example, this rabbit wearing glasses is drinking coffee and reading the newspaper, leisurely and contented.

Kuaishou version of Sora Ke Ling is open for testing: generates over 120s video, understands physics better, and can accurately model complex movements

At the same time, Ke Ling's depiction of details is also very good. For example, in the two slowly blooming flowers, you can see the details of the petals and stamens.

Kuaishou version of Sora Ke Ling is open for testing: generates over 120s video, understands physics better, and can accurately model complex movements

Moreover, Keling not only generates more realistic videos, but also generates videos with a resolution of up to 1080p and a duration of up to 2 minutes (frame rate 30fps), andSupport free aspect ratio.

Kuaishou version of Sora Ke Ling is open for testing: generates over 120s video, understands physics better, and can accurately model complex movements

#It also includes vertical videos, which can be said to be quite consistent with Kuaishou’s short video ecosystem.

In the picture, a train is moving forward, and the scenery outside the window goes through the four seasons of spring, summer, autumn and winter. The entire two-minute picture is very coherent.

Kuaishou version of Sora Ke Ling is open for testing: generates over 120s video, understands physics better, and can accurately model complex movements

At this point, I believe that the effects have been demonstrated enough. If you are still not satisfied, you can go to Keling’s official website platform (see the end of the article for the portal) , watch more amazing AI videos!

(Note: The videos in this article are compressed, and the high-definition and latest effects are subject to the official website)

So behind these videos of Keling, we have used What are the unique technologies?

Native video generation technology route

On the whole, Keling large model adopts the native Vincent video technology route, replacing images The combination of generation + timing modules is also the core secret of long generation time, high frame rate, and ability to accurately handle complex movements.

Specifically, the Kuaishou Big Model Team believes that an excellent video generation model needs to consider four core elements-Model design, data assurance, computing efficiency, and model capabilities Extension.

Sora-like model architecture, scaling law has been verified

Let’s start with the design of the model. There are two main factors that should be considered. One is that it is strong enough. Fitting ability, and the second is enough parameter capacity.

In terms of architecture selection, Keling’s overall framework adopts a Sora-like DiT structure, using Transformer to replace the U-convolutional network-based U-in the traditional diffusion model. Net.

Transformer has more powerful processing and generation capabilities, stronger expansion capabilities, and better convergence efficiency. It solves the problem of excessive redundancy and incompatibility between receptive fields and positioning accuracy when U-Net handles complex tasks. limitations.

On this basis, the Kuaishou large model team also upgraded the latent space encoding/decoding, timing modeling and other modules in the model.

Currently, in latent space encoding/decoding, mainstream video generation models usually use Stable Diffusion’s 2D VAE for spatial compression, but this has obvious information redundancy for videos.

Therefore, the Kuaishou large model team self-developed the 3D VAE network to achieve synchronous compression of space and time, obtain higher reconstruction quality, and achieve better training performance and effects The best balance.

In addition, in terms of temporal information modeling, the Kuaishou large model team has designed a computationally efficient full attention mechanism(3D Attention)As a space-time modeling module.

This method can more accurately model complex spatio-temporal motion, while taking into account the computational cost, effectively improving the modeling ability of the model.

Of course, in addition to the model's own capabilities, the text prompts input by the user also have an important impact on the final generated effect.

To this end, the team has specially designed a dedicated language model, which can perform high-quality expansion and optimization of prompt words input by users.

How is the data constructed? Self-built high-quality data screening solution

After talking about the design of the model, data is also crucial to the performance of the model.

In fact, the insufficient scale and quality of training data are also the thorny problems faced by many video generation model developers.

Online videos are generally of low quality and difficult to meet training needs. The Kuaishou large model team has built a relatively complete tag system, which can filter training data in a refined manner or adjust the distribution of training data.

This system characterizes the quality of video data from multiple dimensions such as basic video quality, aesthetics, and naturalness, and designs a variety of customized label features for each dimension.

When training a video generation model, you need to feed the video and corresponding text description to the model at the same time. The quality of the video itself is also guaranteed. How to obtain its corresponding text description?

The development team specially developed the video description model, which can generate accurate, detailed and structured video descriptions. Significantly improve the text command responsiveness of video generation models.

Even if the model is talented, it cannot be separated from hard work and practice

Now that the model and data are available, the computing efficiency must keep up, so that we can complete massive scale within a limited time Train with data and see significant results.

In order to obtain higher computing efficiency, the Keling model does not adopt the current mainstream DDPM solution in the industry, but uses a flow model with a shorter transmission pathAs a diffusion model base.

From another level, the lack of computing power is also a problem faced by many AI practitioners. Even large model giants like OpenAI have computing power resources that are also in short supply.

This problem may not be completely solved in a short period of time, but what can be done is to improve the efficiency of computing power as much as possible under the conditions of limited overall hardware resources.

The Kuaishou large model team used the distributed training cluster, and greatly improved the Keling large model through operator optimization, recalculation strategy optimization and other means hardware utilization.

During the training process, Keling did not choose to get it right in one step, but adopted a phased training strategy to gradually improve the resolution:

In the initial low-resolution stage, quantity is the main way to win, and a large amount of data is used to enhance the model's understanding of conceptual diversity and modeling capabilities;

In the subsequent high-resolution stage, the quality of the data begins to change. The more important considerations are to further improve model performance and enhance performance in details.

Adopting such a strategy effectively combines the advantages of quantity and quality, ensuring that the model can be optimized and improved at all stages of training.

The needs are ever-changing, and the model is adaptable

On top of the research and development of the basic model, the Kuaishou large model team has also expanded its capabilities from multiple dimensions such as aspect ratio.

In terms of aspect ratio, Keling also does not use the mainstream model to train at a fixed resolution.

Because traditional methods usually introduce pre-processing logic when facing real data with variable aspect ratios, destroying the composition of the original data, resulting in poor composition of the generated results.

In contrast, the solution of the Kuaishou Large Model Team allows the model to directly process data of different aspect ratios, retaining the composition of the original data.

In order to cope with the demand for video generation of several minutes or even longer in the future, the team has also developed a video timing expansion solution based on autoregression without obvious effect degradation.

In addition to text input, Keling also supports a variety of control information input, such as camera movement, frame rate, edges/key points/depth, etc., providing users with rich content control capabilities.

Don’t make big models, application is the last word

The large model industry has been "rolled" to this day, we have witnessed too many technological highlight moments, but the original intention of technological breakthroughs It's still an application.

Kuaishou Keling video generation model was born from the leading short video manufacturer and continues to be explored for applications. It is worth mentioning that the large model of Ke Ling will be launched immediately upon release, without any hassle! Don’t draw a cake! Don’t draw a cake! Keling’s Wensheng video model,

has officially opened for beta testing in Kuaiying APP. The currently open version supports 720P video generation and vertical video generation capabilities Also opening soon.

Kuaishou version of Sora Ke Ling is open for testing: generates over 120s video, understands physics better, and can accurately model complex movementsIn addition to Vincent Video, Kuaishou has also launched other applications based on the Keling model, such as

"AI Dance King"has Launched in Kuaishou and Kuaiying APP. Whether it is subject three or two people, as long as you upload a full-body photo, the characters can dance gracefully to the music in minutes, and even the terracotta warriors and horses can dance in the most dazzling ethnic style.

Kuaishou version of Sora Ke Ling is open for testing: generates over 120s video, understands physics better, and can accurately model complex movementsIn addition to the video generation module, the Kuaishou large model team also added self-developed 3D face reconstruction technology, as well as background stabilization and redirection modules to display it more vividly. Expression and motion effects.

Moreover, the newer

"AI singing and dancing" technology has also made its debut, allowing characters to open their mouths and sing while dancing.

By the way, let me give you a spoiler. The 图生视频 function based on the large Ke Ling model will also be available to users in the near future.

In fact, as a head video manufacturer, Kuaishou also moved quickly during the big model craze. It has previously launched language models and Vincentian graph models.

Based on these models, AI copywriting, AI generated pictures, AI generated videos, and more AI creation functions have been launched in Kuaishou and Kuaiying APPs.

Kuaishou version of Sora Ke Ling is open for testing: generates over 120s video, understands physics better, and can accurately model complex movements

In terms of video generation, Kuaishou has also joined forces with a number of universities or scientific research institutions to successively release direct-a-video, multi-modal video generation algorithms with controllable motion. Key technologies such as the generation algorithm Video-LaVIT, the image-generating video algorithm I2V-Adapter, and the multi-modal aesthetic evaluation model UNIAA have accumulated profound technical accumulation for the Keling model.

Now, Kuaishou’s complete Wensheng video function has finally made its grand debut. We look forward to Kuaishou, as a short video track giant with unique scene advantages and wide application scenarios, to be the first to implement video generation capabilities in short video scenes. Raw flowers.

If you are interested in AI video creation, you might as well go to Kuaiying APP to find out.

Portal:https://www.php.cn/link/1e4dc58a5c8c8908a4d317d6ef44a4d0

The above is the detailed content of Kuaishou version of Sora 'Ke Ling' is open for testing: generates over 120s video, understands physics better, and can accurately model complex movements. For more information, please follow other related articles on the PHP Chinese website!

Related labels:
source:51cto.com
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template