Table of Contents
Comprehensive interpretation of the Sora recurrence plan
Model Architecture Design
Training reproduction plan
The first phase is large-scale image pre-training.
The second stage is large-scale video pre-training.
The third stage is fine-tuning of high-quality video data.
Data Preprocessing
Efficient training support
Home Technology peripherals AI The world's first Sora-like open source reproduction solution is here! Full disclosure of all training details and model weights

The world's first Sora-like open source reproduction solution is here! Full disclosure of all training details and model weights

Mar 18, 2024 pm 08:25 PM
AI ai sora

The world's first open source Sora-like architecture video generation model is here!

The entire training process, including data processing, all training details and model weights, is all open.

This is the just released Open-Sora 1.0.

The actual effect it brings is as follows, it can generate the bustling traffic in the night scene of the bustling city.

The worlds first Sora-like open source reproduction solution is here! Full disclosure of all training details and model weights

You can also use an aerial photography perspective to show the scene of the cliff coast and the sea water lapping against the rocks.

The worlds first Sora-like open source reproduction solution is here! Full disclosure of all training details and model weights

Or the vast starry sky under time-lapse photography.

The worlds first Sora-like open source reproduction solution is here! Full disclosure of all training details and model weights

Since Sora's release, revealing and recreating Sora has become one of the most talked about topics in the development community due to its stunning effects and scarcity of technical details. For example, the Colossal-AI team launched a Sora training and inference replication process that can reduce costs by 46%.

After just two weeks, the team once again released the latest progress, reproducing a Sora-like solution, and made the technical solution and detailed tutorials open source for free on GitHub.

Then the question is, how to reproduce Sora?

Open-Sora open source address: https://github.com/hpcaitech/Open-Sora

Comprehensive interpretation of the Sora recurrence plan

The Sora recurrence plan includes four Aspects:

  • Model architecture design
  • Training reproduction plan
  • Data preprocessing
  • Efficient training optimization strategy

Model Architecture Design

The model adopts Sora homologous architecture Diffusion Transformer (DiT).

It is based on PixArt-α, a high-quality open source Vincent graph model using DiT architecture. On this basis, it introduces a temporal attention layer and extends it to video data.

Specifically, the entire architecture includes a pre-trained VAE, a text encoder and a STDiT (Spatial Temporal Diffusion Transformer) model that utilizes the spatial-temporal attention mechanism.

Among them, the structure of each layer of STDiT is shown in the figure below.

It uses a serial method to superimpose a one-dimensional temporal attention module on a two-dimensional spatial attention module to model temporal relationships. After the temporal attention module, the cross-attention module is used to align the semantics of the text.

Compared with the full attention mechanism, such a structure greatly reduces training and inference overhead.

Compared with the Latte model, which also uses the spatial-temporal attention mechanism, STDiT can better utilize the weights of pre-trained image DiT to continue training on video data.

The worlds first Sora-like open source reproduction solution is here! Full disclosure of all training details and model weights

△STDiT structure diagram

The training and inference process of the entire model is as follows.

It is understood that in the training stage, the pre-trained Variational Autoencoder (VAE) encoder is first used to compress the video data, and then STDiT is trained together with text embedding in the compressed latent space. Diffusion model.

In the inference stage, a Gaussian noise is randomly sampled from the latent space of VAE, and input into STDiT together with prompt embedding to obtain the denoised features, and finally input to VAE decoding processor, decode to get the video.

The worlds first Sora-like open source reproduction solution is here! Full disclosure of all training details and model weights

△Model training process

Training reproduction plan

In the training reproduction part, Open-Sora refers to Stable Video Diffusion (SVD).

It is divided into 3 stages:

  • Large-scale image pre-training.
  • Large-scale video pre-training.
  • Fine-tuning of high-quality video data.

Each stage will continue training based on the weights of the previous stage.

Compared with single-stage training from scratch, multi-stage training achieves the goal of high-quality video generation more efficiently by gradually expanding data.

The worlds first Sora-like open source reproduction solution is here! Full disclosure of all training details and model weights

△Three phases of training plan

The first phase is large-scale image pre-training.

The team used the rich image data and Vincentian graph technology on the Internet to first train a high-quality Vincentian graph model, and used this model as the initialization weight for the next stage of video pre-training.

At the same time, since there is currently no high-quality spatio-temporal VAE, they use Stable Diffusion pre-trained image VAE.

This not only ensures the superior performance of the initial model, but also significantly reduces the overall cost of video pre-training.

The second stage is large-scale video pre-training.

This stage mainly increases the generalization ability of the model and effectively grasps the time series correlation of the video.

It needs to use a large amount of video data for training and ensure the diversity of video materials.

At the same time, the second-stage model adds a temporal attention module based on the first-stage Vincentian graph model to learn temporal relationships in videos. The remaining modules remain consistent with the first stage and load the first stage weights as initialization. At the same time, the output of the temporal attention module is initialized to zero to achieve more efficient and faster convergence.

The Colossal-AI team used PixArt-alpha’s open source weights as the initialization of the second-stage STDiT model, and the T5 model as the text encoder. They used a small resolution of 256x256 for pre-training, which further increased the convergence speed and reduced training costs.

The worlds first Sora-like open source reproduction solution is here! Full disclosure of all training details and model weights

△Open-Sora generation effect (prompt word: shot of the underwater world, a turtle swimming leisurely among the coral reefs)

The third stage is fine-tuning of high-quality video data.

According to reports, this stage can significantly improve the quality of model generation. The data size used is one order of magnitude lower than in the previous stage, but the duration, resolution and quality of the videos are higher.

Fine-tuning in this way can achieve efficient expansion of video generation from short to long, from low resolution to high resolution, and from low fidelity to high fidelity.

It is worth mentioning that Colossal-AI also disclosed the resource usage of each stage in detail.

In the reproduction process of Open-Sora, they used 64 H800s for training. The total training volume of the second stage is 2808 GPU hours, which is approximately US$7,000, and the training volume of the third stage is 1920 GPU hours, which is approximately US$4,500. After preliminary estimation, the entire training plan successfully controlled the Open-Sora reproduction process to about US$10,000.

Data Preprocessing

In order to further reduce the threshold and complexity of Sora reproduction, the Colossal-AI team also provides a convenient video data preprocessing script in the code warehouse, so that everyone can easily Start Sora recurrence pre-training.

Includes downloading public video data sets, segmenting long videos into short video clips based on shot continuity, and using the open source large language model LLaVA to generate precise prompt words.

The batch video title generation code they provide can annotate a video with two cards and 3 seconds, and the quality is close to GPT-4V.

The final video/text pair can be used directly for training. With the open source code they provide on GitHub, you can easily and quickly generate the video/text pairs required for training on your own data set, significantly reducing the technical threshold and preliminary preparation for starting a Sora replication project.

The worlds first Sora-like open source reproduction solution is here! Full disclosure of all training details and model weights

Efficient training support

In addition, the Colossal-AI team also provides a training acceleration solution.

Through efficient training strategies such as operator optimization and hybrid parallelism, an acceleration effect of 1.55 times was achieved in the training of processing 64-frame, 512x512 resolution video.

At the same time, thanks to Colossal-AI’s heterogeneous memory management system, a 1-minute 1080p high-definition video training task can be performed without hindrance on a single server (8H800).

The worlds first Sora-like open source reproduction solution is here! Full disclosure of all training details and model weights

#And the team also found that the STDiT model architecture also showed excellent efficiency during training.

Compared with DiT, which uses a full attention mechanism, STDiT achieves an acceleration effect of up to 5 times as the number of frames increases, which is particularly critical in real-life tasks such as processing long video sequences.

The worlds first Sora-like open source reproduction solution is here! Full disclosure of all training details and model weights

Finally, the team also released more Open-Sora generation effects.

, duration 00:25

The team and Qubits revealed that they will update and optimize Open-Sora related solutions and developments in the long term. In the future, more video training data will be used to generate higher quality, longer video content and support multi-resolution features.

In terms of practical applications, the team revealed that it will promote implementation in movies, games, advertising and other fields.

Interested developers can visit the GitHub project to learn more~

Open-Sora open source address: https://github.com/hpcaitech/Open-Sora

Reference link:

[1]https://arxiv.org/abs/2212.09748 Scalable Diffusion Models with Transformers.

[2]https://arxiv.org/abs/2310.00426 PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis.

[3]https://arxiv.org/abs/2311.15127 Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets.

[4]https://arxiv.org/abs/2401.03048 Latte: Latent Diffusion Transformer for Video Generation.

[5]https://huggingface.co/stabilityai/sd-vae-ft-mse-original.

[6]https://github.com/google-research/text-to-text-transfer-transformer.

[7]https://github.com/haotian-liu/LLaVA.

[8]https://hpc-ai.com/blog/open-sora-v1.0.

The above is the detailed content of The world's first Sora-like open source reproduction solution is here! Full disclosure of all training details and model weights. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

How to define header files for vscode How to define header files for vscode Apr 15, 2025 pm 09:09 PM

How to define header files using Visual Studio Code? Create a header file and declare symbols in the header file using the .h or .hpp suffix name (such as classes, functions, variables) Compile the program using the #include directive to include the header file in the source file. The header file will be included and the declared symbols are available.

Do you use c in visual studio code Do you use c in visual studio code Apr 15, 2025 pm 08:03 PM

Writing C in VS Code is not only feasible, but also efficient and elegant. The key is to install the excellent C/C extension, which provides functions such as code completion, syntax highlighting, and debugging. VS Code's debugging capabilities help you quickly locate bugs, while printf output is an old-fashioned but effective debugging method. In addition, when dynamic memory allocation, the return value should be checked and memory freed to prevent memory leaks, and debugging these issues is convenient in VS Code. Although VS Code cannot directly help with performance optimization, it provides a good development environment for easy analysis of code performance. Good programming habits, readability and maintainability are also crucial. Anyway, VS Code is

Can vscode run kotlin Can vscode run kotlin Apr 15, 2025 pm 06:57 PM

Running Kotlin in VS Code requires the following environment configuration: Java Development Kit (JDK) and Kotlin compiler Kotlin-related plugins (such as Kotlin Language and Kotlin Extension for VS Code) create Kotlin files and run code for testing to ensure successful environment configuration

Which one is better, vscode or visual studio Which one is better, vscode or visual studio Apr 15, 2025 pm 08:36 PM

Depending on the specific needs and project size, choose the most suitable IDE: large projects (especially C#, C) and complex debugging: Visual Studio, which provides powerful debugging capabilities and perfect support for large projects. Small projects, rapid prototyping, low configuration machines: VS Code, lightweight, fast startup speed, low resource utilization, and extremely high scalability. Ultimately, by trying and experiencing VS Code and Visual Studio, you can find the best solution for you. You can even consider using both for the best results.

How to build vscode c How to build vscode c Apr 15, 2025 pm 05:03 PM

VS Code provides a powerful C development environment that improves development efficiency. When configuring, you need to pay attention to path issues, memory leaks and dependency management. Advantages include extended ecosystems, excellent code editing capabilities, and integrated debuggers, while disadvantages are extended dependencies and resource consumption.

Can vscode be used for java Can vscode be used for java Apr 15, 2025 pm 08:33 PM

VS Code is absolutely competent for Java development, and its powerful expansion ecosystem provides comprehensive Java development capabilities, including code completion, debugging, version control and building tool integration. In addition, VS Code's lightweight, flexibility and cross-platformity make it better than bloated IDEs. After installing JDK and configuring JAVA_HOME, you can experience VS Code's Java development capabilities by installing "Java Extension Pack" and other extensions, including intelligent code completion, powerful debugging functions, construction tool support, etc. Despite possible compatibility issues or complex project configuration challenges, these issues can be addressed by reading extended documents or searching for solutions online, making the most of VS Code’s

What does sublime renewal balm mean What does sublime renewal balm mean Apr 16, 2025 am 08:00 AM

Sublime Text is a powerful customizable text editor with advantages and disadvantages. 1. Its powerful scalability allows users to customize editors through plug-ins, such as adding syntax highlighting and Git support; 2. Multiple selection and simultaneous editing functions improve efficiency, such as batch renaming variables; 3. The "Goto Anything" function can quickly jump to a specified line number, file or symbol; but it lacks built-in debugging functions and needs to be implemented by plug-ins, and plug-in management requires caution. Ultimately, the effectiveness of Sublime Text depends on the user's ability to effectively configure and manage it.

How to beautify json with vscode How to beautify json with vscode Apr 15, 2025 pm 05:06 PM

Beautifying JSON data in VS Code can be achieved by using the Prettier extension to automatically format JSON files so that key-value pairs are arranged neatly and indented clearly. Configure Prettier formatting rules as needed, such as indentation size, line breaking method, etc. Use the JSON Schema Validator extension to verify the validity of JSON files to ensure data integrity and consistency.

See all articles