


Stable Diffusion 3 technical report leaked out, Sora architecture has made great achievements again! Is the open source community violently beating Midjourney and DALL·E 3?
After releasing Stable Diffusion 3, Stability AI released a detailed technical report today.
The paper provides an in-depth analysis of the core technology of Stable Diffusion 3 - an improved version of the Diffusion model and a new architecture of Vincentian graphs based on DiT!
Report address:
https://www.php.cn/link /e5fb88b398b042f6cccce46bf3fa53e8
Passed the human evaluation test, Stable Diffusion 3 surpassed DALL·E 3, Midjourney v6 and Ideogram v1 in terms of font design and accurate response to prompts.
Stability AI’s newly developed Multi-Modal Diffusion Transformer (MMDiT) architecture uses independent weight sets specifically for image and language representation. Compared to earlier versions of SD 3, MMDiT has achieved significant improvements in text comprehension and spelling.
Performance Evaluation
Based on human feedback, the technical report will SD 3 to a large number The open source models SDXL, SDXL Turbo, Stable Cascade, Playground v2.5 and Pixart-α, as well as the closed source models DALL·E 3, Midjourney v6 and Ideogram v1 were evaluated in detail.
Evaluators select the best output for each model based on the consistency of the assigned prompts, the clarity of the text, and the overall aesthetics of the images.
The test results show that Stable Diffusion 3 achieves the highest level of accuracy in following prompts, clear presentation of text, and visual beauty of images. Or exceed the current state-of-the-art of Vincentian diagram generation technology.
The SD 3 model, which is not optimized for hardware at all, has 8B parameters and is able to run on an RTX 4090 consumer GPU with 24GB of video memory, and Using 50 sampling steps, it takes 34 seconds to generate a 1024x1024 resolution image.
In addition, Stable Diffusion 3 will provide multiple versions when released, with parameters ranging from 800 million to 8 billion, which can further lower the hardware threshold for use.
Exposure of architectural details
In the process of generating Vincent diagrams, the model needs to process text and images at the same time These two different kinds of information. So the author calls this new framework MMDiT.
In the process of text to image generation, the model needs to process two different information types, text and image, at the same time. This is why the authors call this new technology MMDiT (short for Multimodal Diffusion Transformer).
Like previous versions of Stable Diffusion, SD 3 uses a pre-trained model to extract suitable expressions of text and images.
Specifically, they utilized three different text encoders—two CLIP models and a T5—to process text information, while using a more advanced Autoencoding model to process image information.
The architecture of SD 3 is built on the basis of Diffusion Transformer (DiT). Due to the difference between text and image information, SD 3 sets independent weights for each of these two types of information.
This design is equivalent to equipping two independent Transformers for each information type, but when executing the attention mechanism, the data sequences of the two types of information will be merged, so that they can be used in their respective fields. While working independently, they can maintain mutual reference and integration.
Through this unique architecture, image and text information can flow and interact with each other, thereby improving the understanding of the content in the generated results. Overall understanding and visual representation.
Moreover, this architecture can be easily extended to other modalities including video in the future.
Thanks to SD 3’s improvements in following cues, the model is able to accurately generate images that focus on a variety of different topics and features, while The image style also maintains a high degree of flexibility.
Improving Rectified Flow through re-weighting method
In addition to the launch of the new Diffusion Transformer architecture , SD 3 also made significant improvements to the Diffusion model.
SD 3 adopts the Rectified Flow (RF) strategy to connect the training data and noise along a straight trajectory.
This method makes the model’s inference path more direct, so the sample generation can be completed in fewer steps.
The author introduced an innovative trajectory sampling plan in the training process, especially increasing the weight of the middle part of the trajectory, and the prediction of these parts The mission is more challenging.
By comparing with 60 other diffusion trajectories (such as LDM, EDM, and ADM), the authors found that although the previous RF method performed better in fewer steps of sampling, as the sampling As the number of steps increases, performance will slowly decrease.
In order to avoid this situation, the weighted RF method proposed by the author can continue to improve model performance.
Extended RF Transformer model
Stability AI trained multiple models of different sizes, from 15 modules and 450M parameters to 38 modules and 8B parameters, and found the model Both size and training steps reduce validation loss smoothly.
To verify whether this meant a substantial improvement in model output, they also evaluated automatic image alignment metrics and human preference scores.
The results show that these evaluation indicators are strongly correlated with the verification loss, indicating that the verification loss is an effective indicator to measure the overall performance of the model.
In addition, this expansion trend has not reached a saturation point, making us optimistic that we can further improve model performance in the future.
The author trained the model for 500k steps with different numbers of parameters at a resolution of 256 * 256 pixels and a batch size of 4096.
The above figure illustrates the impact of training a larger model for a long time on sample quality.
The table above shows the results of GenEval. When using the training method proposed by the authors and increasing the resolution of the training images, the largest model performed well in most categories, surpassing DALL·E by 3 in the overall score.
According to the author's test comparison of different architecture models, MMDiT is very effective, surpassing DiT, Cross DiT, UViT, and MM-DiT.
Flexible text encoder
By removing the memory-intensive 4.7B parameter T5 text encoder during the inference phase, SD 3's memory requirements are significantly reduced with minimal performance loss.
Removing this text encoder will not affect the visual beauty of the image (50% win rate without T5), but will only slightly reduce the ability of the text to follow accurately (46% win rate) .
However, in order to give full play to SD 3's ability to generate text, the author still recommends using the T5 encoder.
Because the author found that without it, the performance of typesetting to generate text would be even greater (win rate 38%).
Netizens’ hot discussion
Netizens continue to tease users about Stability AI but refuse to use it They seemed a little impatient, and they all urged to put it online quickly for everyone to use.
After reading the technical application, netizens said that it seems that the photography circle is now going to be the first track where open source will overwhelm closed source!
The above is the detailed content of Stable Diffusion 3 technical report leaked out, Sora architecture has made great achievements again! Is the open source community violently beating Midjourney and DALL·E 3?. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

Imagine an artificial intelligence model that not only has the ability to surpass traditional computing, but also achieves more efficient performance at a lower cost. This is not science fiction, DeepSeek-V2[1], the world’s most powerful open source MoE model is here. DeepSeek-V2 is a powerful mixture of experts (MoE) language model with the characteristics of economical training and efficient inference. It consists of 236B parameters, 21B of which are used to activate each marker. Compared with DeepSeek67B, DeepSeek-V2 has stronger performance, while saving 42.5% of training costs, reducing KV cache by 93.3%, and increasing the maximum generation throughput to 5.76 times. DeepSeek is a company exploring general artificial intelligence

Earlier this month, researchers from MIT and other institutions proposed a very promising alternative to MLP - KAN. KAN outperforms MLP in terms of accuracy and interpretability. And it can outperform MLP running with a larger number of parameters with a very small number of parameters. For example, the authors stated that they used KAN to reproduce DeepMind's results with a smaller network and a higher degree of automation. Specifically, DeepMind's MLP has about 300,000 parameters, while KAN only has about 200 parameters. KAN has a strong mathematical foundation like MLP. MLP is based on the universal approximation theorem, while KAN is based on the Kolmogorov-Arnold representation theorem. As shown in the figure below, KAN has

Boston Dynamics Atlas officially enters the era of electric robots! Yesterday, the hydraulic Atlas just "tearfully" withdrew from the stage of history. Today, Boston Dynamics announced that the electric Atlas is on the job. It seems that in the field of commercial humanoid robots, Boston Dynamics is determined to compete with Tesla. After the new video was released, it had already been viewed by more than one million people in just ten hours. The old people leave and new roles appear. This is a historical necessity. There is no doubt that this year is the explosive year of humanoid robots. Netizens commented: The advancement of robots has made this year's opening ceremony look like a human, and the degree of freedom is far greater than that of humans. But is this really not a horror movie? At the beginning of the video, Atlas is lying calmly on the ground, seemingly on his back. What follows is jaw-dropping

AI is indeed changing mathematics. Recently, Tao Zhexuan, who has been paying close attention to this issue, forwarded the latest issue of "Bulletin of the American Mathematical Society" (Bulletin of the American Mathematical Society). Focusing on the topic "Will machines change mathematics?", many mathematicians expressed their opinions. The whole process was full of sparks, hardcore and exciting. The author has a strong lineup, including Fields Medal winner Akshay Venkatesh, Chinese mathematician Zheng Lejun, NYU computer scientist Ernest Davis and many other well-known scholars in the industry. The world of AI has changed dramatically. You know, many of these articles were submitted a year ago.

The performance of JAX, promoted by Google, has surpassed that of Pytorch and TensorFlow in recent benchmark tests, ranking first in 7 indicators. And the test was not done on the TPU with the best JAX performance. Although among developers, Pytorch is still more popular than Tensorflow. But in the future, perhaps more large models will be trained and run based on the JAX platform. Models Recently, the Keras team benchmarked three backends (TensorFlow, JAX, PyTorch) with the native PyTorch implementation and Keras2 with TensorFlow. First, they select a set of mainstream

Today I would like to share a recent research work from the University of Connecticut that proposes a method to align time series data with large natural language processing (NLP) models on the latent space to improve the performance of time series forecasting. The key to this method is to use latent spatial hints (prompts) to enhance the accuracy of time series predictions. Paper title: S2IP-LLM: SemanticSpaceInformedPromptLearningwithLLMforTimeSeriesForecasting Download address: https://arxiv.org/pdf/2403.05798v1.pdf 1. Large problem background model

Target detection is a relatively mature problem in autonomous driving systems, among which pedestrian detection is one of the earliest algorithms to be deployed. Very comprehensive research has been carried out in most papers. However, distance perception using fisheye cameras for surround view is relatively less studied. Due to large radial distortion, standard bounding box representation is difficult to implement in fisheye cameras. To alleviate the above description, we explore extended bounding box, ellipse, and general polygon designs into polar/angular representations and define an instance segmentation mIOU metric to analyze these representations. The proposed model fisheyeDetNet with polygonal shape outperforms other models and simultaneously achieves 49.5% mAP on the Valeo fisheye camera dataset for autonomous driving

The latest video of Tesla's robot Optimus is released, and it can already work in the factory. At normal speed, it sorts batteries (Tesla's 4680 batteries) like this: The official also released what it looks like at 20x speed - on a small "workstation", picking and picking and picking: This time it is released One of the highlights of the video is that Optimus completes this work in the factory, completely autonomously, without human intervention throughout the process. And from the perspective of Optimus, it can also pick up and place the crooked battery, focusing on automatic error correction: Regarding Optimus's hand, NVIDIA scientist Jim Fan gave a high evaluation: Optimus's hand is the world's five-fingered robot. One of the most dexterous. Its hands are not only tactile
