Ces dernières années, les puissantes capacités de synthèse d’images des modèles de diffusion ont été pleinement démontrées. La communauté des chercheurs s'attaque désormais à une tâche plus difficile : la génération de vidéos. Récemment, Lilian Weng, responsable d'OpenAI Safety Systems, a écrit un blog sur le modèle de diffusion de la génération vidéo.
Il s'agit en soi d'un sur-ensemble de synthèse d'images, car l'image est une seule image de vidéo. La synthèse vidéo est beaucoup plus difficile pour les raisons suivantes : 1. La synthèse vidéo nécessite également une cohérence temporelle entre les différentes images, ce qui nécessite naturellement d'encoder davantage de connaissances du monde dans le modèle.
2. Par rapport au texte ou aux images, il est plus difficile de collecter une grande quantité de données vidéo de haute qualité et de haute dimension, sans parler des données texte-vidéo couplées.
Si vous souhaitez en savoir plus sur l'application des modèles de diffusion dans la génération d'images, vous pouvez lire le précédent article de blog « Que sont les modèles de diffusion ? » publié par l'auteur de cet article, Lilian Weng Lien : https://lilianweng. github.io/posts/2021-07 -11-diffusion-models/
Modélisation de la génération vidéo à partir de zéroTout d'abord, voyons comment concevoir et entraîner un modèle vidéo de diffusion à partir de zéro, c'est-à-dire sans à l'aide d'un générateur d'images pré-entraîné.
Paramétrage et échantillonnage
Les définitions de variables utilisées ici sont légèrement différentes de celles de l'article précédent, mais la forme mathématique est cohérente. Soit ?~?_real un point de données échantillonné à partir de cette distribution de données réelles. Maintenant, l'ajout d'une petite quantité de bruit gaussien au temps crée une séquence de variantes bruyantes de ?, notées : {?_ = 1..., ?}, où le bruit augmente avec l'augmentation de ?, et enfin ? ?(?_?)~?(?,?). Ce processus direct d’ajout de bruit est un processus gaussien. Soient ?_? et ?_? le programme de bruit différentiable de ce processus gaussien :
Afin de représenter ?(?_?|?_?), où 0≤?≤?, Il y a : Supposons que le rapport signal/bruit logarithmique soit
, alors la mise à jour DDIM peut être exprimée comme suit : ≤?,有:
L'article "Progressive Distillation for Fast Sampling of Diffusion Models" par Salimans & Ho (2022) est proposé ici Un paramètre de prédiction spécial : . La recherche montre que le paramètre ? permet d'éviter les problèmes de changement de couleur lors de la génération vidéo par rapport au paramètre ?. Le paramétrage de
? est dérivé via des astuces en coordonnées angulaires. Tout d’abord, définissez ?_?=arctan (?_?/?_?), à partir duquel nous pouvons obtenir ?_?=cos ?, ?_?=sin ?, ?_?=cos ??+sin ??. La vitesse de ?_? peut s'écrire :
et ensuite elle peut être déduite : Les règles de mise à jour DDIM peuvent être mises à jour en conséquence : 坐 Figure 1: The working method of the spread of the spread of the spread is displayed in the form of an angle coordinate.For the model, the parameterization of ? is prediction For video generation tasks, in order to extend the video length or increase the frame rate, the diffusion model needs to run multiple upsampling steps. This requires the ability to sample the second video ?^? based on the first video ?^?, , where ?^? may be an autoregressive extension of ?^? or a missing frame in a low frame rate video . In addition to its own corresponding noisy variable, the sampling of ?^? also needs to be based on ?^?. The video diffusion model (VDM) of Ho & Salimans et al. in 2022 proposes to use an adjusted denoising model to implement the reconstruction guidance method, so that the sampling of ?^? can be well based on ?^?:
where is the reconstruction of ?^? and ?^? according to the provided denoising model. And ?_? is a weighting factor, you can find a larger ?_? > 1 to improve the sampling quality. Note that using the same reconstruction guidance method, it is also possible to expand samples based on low-resolution videos into high-resolution samples.
Model architecture: 3D U-Net and DiT
Similar to the Vincent graph diffusion model, U-Net and Transformer are still common architecture choices. Google has developed a series of diffusion video modeling papers based on the U-net architecture, and OpenAI's recent Sora model utilizes the Transformer architecture.VDM uses a standard diffusion model setup, but makes some modifications to the architecture to make it more suitable for video modeling tasks. It extends the 2D U-net to handle 3D data, where each feature map represents a 4D tensor: number of frames x height x width x number of channels. This 3D U-net is decomposed in space and time, which means that each layer only operates one dimension of space or time, but not both at the same time.
Processing space: The original 2D convolution layer as in 2D U-net will be expanded into a 3D convolution only for space. Specifically, 3x3 convolution becomes 1x3x3 convolution. Each spatial attention module still focuses on spatial attention, and the first axis (frames) is treated as a batch dimension.
Processing time: A temporal attention module is added after each spatial attention module. It focuses on the first axis (frames) and treats the spatial axis as the batch dimension. Using this relative position embedding it is possible to track the order of frames. This temporal attention module allows the model to achieve good temporal consistency.
A frozen T5 text encoder to provide text embeddings as conditional input. A basic video diffusion model.
Both SSR and TSR models are based on upsampled inputs concatenated channel-wise with noisy data ?_? SSR upsamples by bilinear resizing, while TSR upsamples by repeating frames or filling blank frames. Imagen Video also applies progressive distillation to speed up sampling, cutting the required sampling steps in half with each distillation iteration. In experiments, they were able to distill all 7 video diffusion models into just 8 sampling steps per model without any noticeable loss in perceptual quality. In order to better expand the model scale, Sora adopts the DiT (diffusion Transformer) architecture, which operates on the spacetime patch of video and image latent codes. It represents the visual input as a sequence of spatio-temporal blocks and uses these spatio-temporal blocks as Transformer input tokens. Figure 5: Sora is a diffusion Transformer model.
Adjust the image model to generate the videoIn terms of diffusion video modeling, another important method is to "augment" the pre-trained Vincent graph diffusion model by inserting temporal layers, and then you can choose to only Fine-tune new layers on video or avoid additional training entirely. This new model inherits prior knowledge of text-image pairs, thereby helping to alleviate the need for text-video pairs.
Fine-tuning on video dataMake-A-Video proposed by Singer et al. in 2022 is to extend a temporal dimension based on a pre-trained diffusion image model, which contains three key components:
1. A basic Vincent graph model trained on text-image pair data.2. Spatiotemporal convolution and attention layers enable the network to cover the time dimension.
3. A frame interpolation network for high frame rate generation. Figure 6: Make-A-Video workflow diagram.理 The mathematical form of the final video reasoning scheme can be written like this:Among them:
? It is entering text
is a text encoded by BPEP (.) is a priori, given text embedding ?_? and BPE encoded text generate image embedding ?_?:
This part is trained on text-image pair data, not on video Fine-tune the data.is a spatio-temporal decoder that generates a series of 16 frames of video, where each frame is a low-resolution 64x64 RGB image.
is a frame interpolation network that can effectively improve the frame rate by interpolating between generated frames. This is a fine-tuned model that can be used to predict masked frames for video upsampling tasks.
are spatial and spatiotemporal super-resolution models that can increase image resolution to 256x256 and 768x768 respectively.
is the final generated video.
The spatiotemporal super-resolution layer contains pseudo-3D convolutional layers and pseudo-3D attention layers:
Pseudo-3D convolutional layers: Each spatial 2D convolutional layer (initialized by the pre-trained image model) is followed by Temporal 1D layer (initialized by the identity function). Conceptually, a 2D convolutional layer first generates multiple frames, which are then adjusted into a video.
Pseudo 3D attention layer: stack a temporal attention layer after each (pre-trained) spatial attention layer to approximate a complete spatiotemporal attention layer. Figure 7: How pseudo-3D convolution (left) and attention (right) layers work.
They can be expressed as:
where are the input tensors ?∈ℝ^{?×?×?×?×?} (corresponding to batch size, number of channels, number of frames, height and width) ; The function of is to exchange time and space dimensions; flatten (.) is a matrix operator that can convert ? into ?'∈ℝ^{?×?×?×??}, while flatten⁻¹(.) The effect is the opposite.
When training, different components in the Make-A-Video workflow are trained separately. 1. The decoder D^?, the prior P and the two super-resolution componentsare first trained on images alone without using paired text.
2. Next add a new temporal layer, which is initialized to the identity function, and then fine-tuned on the unlabeled video data.Tune-A-Video proposed by Wu et al. in 2023 is an extension of a pre-trained image diffusion model to enable single-sample video fine-tuning: given a video containing ? frames ?={ ?_? | ?=1,...,?}, paired with a descriptive prompt ?, the task goal is to generate a new video ?* based on a slightly edited and relevant text prompt ?*. For example, ? = "A man is skiing" can be expanded to ?* ="Spiderman is skiing on the beach". Tune-A-Video is designed for object editing, background modification, and style transfer.
In addition to extending the 2D convolutional layer, Tune-A-Video’s U-Net architecture also integrates the ST-Attention (spatio-temporal attention) module, which can achieve temporal consistency by querying relevant positions in previous frames. Given the latent features of frame ?_?, previous frame ?_{?-1} and first frame ?_1 (which are projected into query?, key? and value?), ST-Attention is defined as:
Figure 8: Tune-A-Video architecture overview. It first runs a lightly weighted fine-tuning stage on a single video before the sampling stage. Please note that the entire Temporal Self-Attention (T-Attn) layer will be fine-tuned since they are newly added, but during the fine-tuning phase, only the query projections in ST-Attn and Cross-Attn will be updated to preserve the prior semantics. Graph knowledge. ST-Attn can improve spatiotemporal consistency, and Cross-Attn can optimize text-video alignment.
The Gen-1 model (Runway) proposed by Esser et al. in 2023 targets the task of editing a given video based on text input. It treats the structure and content of the video separately when considering the generation condition: p (? | ?,c). However, it is not easy to clearly separate these two aspects.
Content ? refers to the appearance and semantics of a video, which can be sampled from text for conditional editing. CLIP embeddings of video frames represent content well and remain largely orthogonal to structural features.
Structure ? describes the geometric properties and dynamics, including shape, position, and time changes of objects, ? is sampled from the input video. Depth estimation or other task-specific auxiliary information (such as human pose or face identity information for human video synthesis) can be used.
The architectural changes in Gen-1 are pretty standard, i.e. adding a 1D temporal convolutional layer after every 2D spatial convolutional layer in its residual module and every 2D spatial attention in its attention module Add the 1D temporal attention module after the module. During training, the structural variable ? is concatenated with the diffuse latent variable ? , where the content variable ? is provided in the cross-attention layer. At inference time, the CLIP embedding is transformed by a prior - converting it from a CLIP text embedding to a CLIP image embedding. Figure 9: Overview of the training process of the Gen-1 model.
Blattmann et al. proposed in 2023 Video LDM first trains an LDM (Latent Diffusion Model) image generator. The model is then fine-tuned to produce videos with the added temporal dimension. This fine-tuning process is only used for those newly added temporal layers on the encoded image sequence. Temporal layers(see Figure 10) in Video LDM are interleaved with existing spatial layers , and these spatial layers remain frozen during fine-tuning. That is to say, only the new parameters ? are fine-tuned here, and the pre-trained image backbone model parameters ? are not fine-tuned. The workflow of Video LDM is to first generate low frame rate keyframes and then increase the frame rate through a 2-step implicit frame interpolation process. An input sequence of length ? is interpreted into a batch of images (i.e. ?・?) for the base image model ? and then resized into a video format for the temporal layer. There is a skip connection that leads to the combination of the temporal layer output ?' and the spatial output ? through a learned fusion parameter ?. In practice, two types of temporal mixing layers are implemented: (1) temporal attention, (2) 3D convolution-based residual module.
Figure 10: A pre-trained LDM for image synthesis extended into a video generator. B,?,?,?,? are batch size, sequence length, number of channels, height and width respectively. ?_S is an optional condition/context frame.
However, LDM’s pre-trained autoencoder still has a problem: it can only see images, never videos. Using it directly to generate video will produce flickering artifacts with poor temporal consistency. Therefore, Video LDM adds an additional temporal layer to the decoder and uses a block-by-block temporal discriminator built with 3D convolution to fine-tune the video data, while the encoder remains unchanged, so that the pre-trained data can still be reused. LDM. During temporal decoder fine-tuning, the frozen encoder processes each frame of the video independently and uses a video-aware discriminator to enforce temporally consistent reconstructions between frames. Figure 11: The training workflow of the autoencoder in the video latent diffusion model. The fine-tuning goal of the encoder is to obtain temporal consistency through a new cross-frame discriminator, while the encoder remains unchanged. Similar to Video LDM, the architectural design of Stable Video Diffusion (SVD) proposed by Blattmann et al. in 2023 is also based on LDM, in which a temporal layer is inserted after each spatial convolution and attention layer, but SVD is Fine-tuning is performed at the entire model level. Training video LDM is divided into three stages: 1. Vincent diagram pre-training is very important, helping to improve quality and the ability to follow prompts. 2. It would be advantageous to separate the video pre-training, which should ideally be done on a larger scaled compiled dataset. 3. Use a smaller, high visual fidelity, pre-subtitled video for high-quality video fine-tuning. SVD specifically emphasizes the critical role of data set assembly on model performance. They used a clip detection pipeline to obtain more clips from each video, and then used three different subtitle tagger models on them: (1) CoCa for in-between frames, (2) V-V for video subtitles BLIP, (3) uses LLM for annotation based on the first two annotations. They can then continue to improve the video data set by removing video clips with less motion (filtering by calculating a low optical flow score at 2 fps) and cleaning up excessive text (using optical character recognition to identify characters with large amounts of text). text), and remove videos that do not look beautiful enough (use CLIP embedding to annotate the first, middle, and last frames of each video and calculate aesthetic scores and text-image similarity). Experiments show that using a filtered higher quality dataset results in better model quality, even if this dataset is much smaller. For methods that first generate distant keyframes and then use temporal super-resolution for interpolation, the key challenge is how to maintain high-quality temporal consistency. Lumiere proposed by Bar-Tal et al. in 2024 uses a spatio-temporal U-Net (STUNet) architecture, which can generate a continuous video for the entire time in a single pass, so that there is no need to rely on TSR (temporal super-resolution) component. STUNet downsamples the video in both temporal and spatial dimensions and is therefore computationally expensive in a compact temporal-spatial latent space. Figure 12: Lumiere does not require a TSR (temporal super-resolution) model. Due to memory limitations, the extended SSR network can use only short segments of the video, so the SSR model can use a shorter but overlapping set of video segments. STUNet is able to simultaneously downsample and upsample videos in the temporal and spatial dimensions after being extended on the pre-trained Vincent graph U-Net. The convolution-based module consists of pre-trained Vincent layers, followed by decomposed spatiotemporal convolutions. And the attention-based module at the coarsest-grained U-Net level consists of this pre-trained Vincentian graph module, followed by temporal attention. Only the newly added layers require further training. . No training adaptation It is also possible to have a pre-trained Vincentian graph model output video without using any training, which is somewhat surprising. If we simply randomly sample a sequence of hidden codes and then construct a video using the decoded corresponding images, there is no guarantee that the object and semantics will be consistent in time. Text2Video-Zero, proposed by Khachatryan et al. in 2023, enables zero-shot, no-training video generation by equipping a pre-trained image diffusion model with two key mechanisms for temporal consistency. 1. Sampling implicit code sequences with motion dynamics to ensure temporal consistency of the global scene and background. 2. Use a new cross-frame attention (the attention of each frame on the first frame) to reprogram the frame-level self-attention to ensure the consistency of the context, appearance and identity information of foreground things. Figure 14: Schematic diagram of Text2Video-Zero workflow. 1. Define a direction ?=(?_?, ?_?)∈ℝ² to control the global scene and camera movement; by default, set ?=(1, 1). Define another hyperparameter λ>0 to control the amount of global motion. 2. First randomly sample the hidden code of the first frame 3. Use a pre-trained image diffusion model (such as the Stable Diffusion (SD) model in the paper) to perform a Δ?≥0 DDIM backward update step, and get The corresponding implicit code , where ?'=?-Δ?. 4. For each frame in the implicit code sequence, use a distortion operation (which is defined as ?^?=λ(?-1)? ) to perform the corresponding motion translation, and get 5. Finally, for All Using DDIM forward step, we get Additionally, Text2Video-Zero also replaces the self-attention layer in the pre-trained SD model and replaces it with a new cross-frame attention that references the first frame force mechanism. The goal is to preserve the context, appearance, and identity information of foreground objects in the generated video results. There is also the option to use a background mask to make the video background transition smoother and further improve background consistency. Assume that we have used some method to obtain the corresponding foreground mask of frame ?_?, and then the background smoothing operation can fuse the actual hidden code and the hidden code distorted by the diffusion step according to the following background matrix: where is the actual hidden code, is the hidden code distorted on the background, ? is a hyperparameter, set ?=0.6 in the experiment of this paper. Text2Video-Zero can be combined with ControlNet, where at each diffusion time step ?=?,...,1, each frame is pretrained using ControlNet on (?=1,...,?) copy branch and add the output of the ControlNet branch to the skip connection of the main U-Net. ControlVideo proposed by Zhang et al. in 2023 aims to generate videos based on text prompts and motion sequences (such as depth or edge maps) . This model is adjusted based on ControlNet, with three new mechanisms added: 1. Cross-frame attention: Add complete cross-frame interaction in the self-attention module. It introduces interaction between all frames by mapping hidden frames for all time steps to a ?,?,? matrix, unlike Text2Video-Zero which has all frames focus on the first frame. 2. The interleaved-frame smoother mechanism reduces the flicker effect by using frame interpolation on alternating frames. At each time step ?, the smoother interpolates even or odd frames to smooth their corresponding three-frame clip. Note that the number of frames decreases over time after the smoothing step. 3. The layered sampler can ensure the temporal consistency of long videos under memory constraints. A long video will be divided into multiple short videos, and a key frame will be selected for each short video. The model pre-generates these keyframes using full cross-frame attention for long-term consistency, and each corresponding short video is sequentially synthesized based on these keyframes. Original link: https://lilianweng.github.io/posts/2024-04-12-diffusion-video/
Ce qui précède est le contenu détaillé de. pour plus d'informations, suivez d'autres articles connexes sur le site Web de PHP en chinois!