Two days ago, Turing Award winner Yann LeCun reprinted the long comic "Go to the Moon and Explore Yourself", which aroused heated discussion among netizens.
In the paper "Story Diffusion: Consistent Self-Attention for long-range image and video generation", the research team proposed a method called Story Diffusion's new approach to generating consistent images and videos depicting complex situations. The research on these comics comes from institutions such as Nankai University and ByteDance.
The related project has obtained 1k Stars on GitHub.
GitHub address: https://github.com/HVision-NKU/StoryDiffusion
Based on Project demonstration, StoryDiffusion can generate comics of various styles, telling a coherent story while maintaining the consistency of character style and clothing.
StoryDiffusion can maintain the identity of multiple characters simultaneously and generate consistent characters across a series of images.
In addition, StoryDiffusion can generate high-quality videos conditioned on generated consistent images or user-entered images.
We know how, for diffusion-based generative models, in a series of generated images Maintaining content consistency, especially images that contain complex themes and details, is a significant challenge.
Therefore, the research team proposed a new self-attention calculation method, called Consistent Self-Attention (Consistent Self-Attention), by establishing batches when generating images. The connection between images within the image is maintained to maintain the consistency of the characters, and thematically consistent images can be generated without training.
In order to extend this method to long video generation, the research team introduced a semantic motion predictor (Semantic Motion Predictor), which encodes images into semantic space and predicts motion in semantic space. motion to generate videos. This is more stable than motion prediction based only on the latent space.
Then perform framework integration, combining consistent self-attention and semantic motion predictors to generate consistent videos and tell complex stories. StoryDiffusion can generate smoother and more coherent videos than existing methods.
Figure 1: Images and videos generated by the team’s StroyDiffusion
The research team’s method can be divided into two stages, as shown in Figures 2 and 3.
In the first stage, StoryDiffusion uses Consistent Self-Attention to generate topic-consistent images in a training-free manner. These consistent images can be used directly in storytelling or as input to a second stage. In the second stage, StoryDiffusion creates consistent transition videos based on these consistent images.
Figure 2: StoryDiffusion process overview for generating theme-consistent images
Figure 3: Generate transition video to obtain approach to thematically consistent images.
The research team introduced the method of "how to generate thematically consistent images without training". The key to solving the above problem is how to maintain the consistency of the characters in a batch of images. This means that during the generation process, they need to establish connections between a batch of images.
After re-examining the role of different attention mechanisms in the diffusion model, they were inspired to explore the use of self-attention to maintain image consistency within a batch of images, and proposed Consistent Self-Attention.
The research team inserted consistent self-attention into the original self-attention position in the U-Net architecture of the existing image generation model, and reused the original self-attention weights. To maintain no training and plug-and-play features.
Given paired tokens, the research team’s method performs self-attention on a batch of images, promoting interactions between different image features. This type of interaction drives the model's convergence on characters, faces, and clothing during generation. Although the consistent self-attention method is simple and requires no training, it can effectively generate thematically consistent images.
To illustrate more clearly, the research team shows pseudocode in Algorithm 1.
Semantic motion predictor for video generation
Research Team A Semantic Motion Predictor is proposed, which encodes images into image semantic space to capture spatial information, thereby achieving more accurate motion prediction from a given start frame and end frame.
More specifically, in the semantic motion predictor proposed by the team, they first use a function E to establish a mapping from the RGB image to the image semantic space vector. Information is encoded.
The team did not directly use the linear layer as the function E. Instead, it used a pre-trained CLIP image encoder as the function E to take advantage of its zero samples (zero- shot) capability to improve performance.
Using function E, the given start frame F_s and end frame F_e are compressed into image semantic space vectors K_s and K_e.
In terms of generating subject-consistent images, since the team’s method requires no training And it is plug-and-play, so they have implemented this method in both versions of Stable Diffusion XL and Stable Diffusion 1.5. To be consistent with the compared models, they used the same pre-trained weights on the Stable-XL model for comparison.
To generate consistent videos, the researchers implemented their research method based on the Stable Diffusion 1.5 specialized model and integrated a pre-trained temporal module to support video generation. All compared models use a 7.5 classifier-free guidance score and 50-step DDIM sampling.
Consistency Image Generation Comparison
The team achieved this by working with two of the latest ID preservation methods - IP- Adapter and Photo Maker - Comparison, evaluating their methods of generating thematically consistent images.
To test performance, they used GPT-4 to generate twenty role instructions and one hundred activity instructions to describe specific activities.
Qualitative results are shown in Figure 4: "StoryDiffusion is able to generate highly consistent images. While other methods, such as IP-Adapter and PhotoMaker, may produce inconsistent clothing or controllable text Lowered image."
Figure 4: Comparison results with current methods on consistent image generation
The researchers are in Table 1 Results of quantitative comparisons are presented. The results show: "The team's StoryDiffusion achieved the best performance on both quantitative metrics, indicating that the method can fit the prompt description well while maintaining character characteristics, and shows its robustness."
Table 1: Quantitative comparison results of consistent image generation
Comparison of transition video generation
In terms of transition video generation, the research team compared it with two state-of-the-art methods - SparseCtrl and SEINE - to evaluate performance.
They conducted a qualitative comparison of transition video generation and displayed the results in Figure 5. The results show: "The team's StoryDiffusion is significantly better than SEINE and SparseCtrl, and the generated transition video is smooth and consistent with physical principles."
Figure 5: Comparison of transition video generation currently using various state-of-the-art methods
They also compared this method with SEINE and SparseCtrl, and used LPIPSfirst, LPIPS- The four quantitative indicators including frames, CLIPSIM-first and CLIPSIM-frames are shown in Table 2.
Table 2: Quantitative comparison with the current state-of-the-art transition video generation model
For more technical and experimental details, please refer to Original paper.
The above is the detailed content of LeCun on the moon? Nankai and Byte open source StoryDiffusion to make multi-picture comics and long videos more coherent. For more information, please follow other related articles on the PHP Chinese website!