Table of Contents
Extended RF Transformer model
Home Technology peripherals AI Stable Diffusion 3 technical report leaked out, Sora architecture has made great achievements again! Is the open source community violently beating Midjourney and DALL·E 3?

Stable Diffusion 3 technical report leaked out, Sora architecture has made great achievements again! Is the open source community violently beating Midjourney and DALL·E 3?

Mar 06, 2024 pm 04:22 PM
Model Evaluation

After releasing Stable Diffusion 3, Stability AI released a detailed technical report today.

The paper provides an in-depth analysis of the core technology of Stable Diffusion 3 - an improved version of the Diffusion model and a new architecture of Vincentian graphs based on DiT!

Stable Diffusion 3技术报告流出,Sora构架再立大功!生图圈开源暴打Midjourney和DALL·E 3?

Report address:

https://www.php.cn/link /e5fb88b398b042f6cccce46bf3fa53e8

Passed the human evaluation test, Stable Diffusion 3 surpassed DALL·E 3, Midjourney v6 and Ideogram v1 in terms of font design and accurate response to prompts.

Stability AI’s newly developed Multi-Modal Diffusion Transformer (MMDiT) architecture uses independent weight sets specifically for image and language representation. Compared to earlier versions of SD 3, MMDiT has achieved significant improvements in text comprehension and spelling.

Stable Diffusion 3技术报告流出,Sora构架再立大功!生图圈开源暴打Midjourney和DALL·E 3?

Performance Evaluation

Based on human feedback, the technical report will SD 3 to a large number The open source models SDXL, SDXL Turbo, Stable Cascade, Playground v2.5 and Pixart-α, as well as the closed source models DALL·E 3, Midjourney v6 and Ideogram v1 were evaluated in detail.

Evaluators select the best output for each model based on the consistency of the assigned prompts, the clarity of the text, and the overall aesthetics of the images.

Stable Diffusion 3技术报告流出,Sora构架再立大功!生图圈开源暴打Midjourney和DALL·E 3?

The test results show that Stable Diffusion 3 achieves the highest level of accuracy in following prompts, clear presentation of text, and visual beauty of images. Or exceed the current state-of-the-art of Vincentian diagram generation technology.

Stable Diffusion 3技术报告流出,Sora构架再立大功!生图圈开源暴打Midjourney和DALL·E 3?

The SD 3 model, which is not optimized for hardware at all, has 8B parameters and is able to run on an RTX 4090 consumer GPU with 24GB of video memory, and Using 50 sampling steps, it takes 34 seconds to generate a 1024x1024 resolution image.

In addition, Stable Diffusion 3 will provide multiple versions when released, with parameters ranging from 800 million to 8 billion, which can further lower the hardware threshold for use.

Stable Diffusion 3技术报告流出,Sora构架再立大功!生图圈开源暴打Midjourney和DALL·E 3?

Exposure of architectural details

In the process of generating Vincent diagrams, the model needs to process text and images at the same time These two different kinds of information. So the author calls this new framework MMDiT.

In the process of text to image generation, the model needs to process two different information types, text and image, at the same time. This is why the authors call this new technology MMDiT (short for Multimodal Diffusion Transformer).

Like previous versions of Stable Diffusion, SD 3 uses a pre-trained model to extract suitable expressions of text and images.

Specifically, they utilized three different text encoders—two CLIP models and a T5—to process text information, while using a more advanced Autoencoding model to process image information.

Stable Diffusion 3技术报告流出,Sora构架再立大功!生图圈开源暴打Midjourney和DALL·E 3?

The architecture of SD 3 is built on the basis of Diffusion Transformer (DiT). Due to the difference between text and image information, SD 3 sets independent weights for each of these two types of information.

Stable Diffusion 3技术报告流出,Sora构架再立大功!生图圈开源暴打Midjourney和DALL·E 3?

This design is equivalent to equipping two independent Transformers for each information type, but when executing the attention mechanism, the data sequences of the two types of information will be merged, so that they can be used in their respective fields. While working independently, they can maintain mutual reference and integration.

Stable Diffusion 3技术报告流出,Sora构架再立大功!生图圈开源暴打Midjourney和DALL·E 3?

Through this unique architecture, image and text information can flow and interact with each other, thereby improving the understanding of the content in the generated results. Overall understanding and visual representation.

Moreover, this architecture can be easily extended to other modalities including video in the future.

Stable Diffusion 3技术报告流出,Sora构架再立大功!生图圈开源暴打Midjourney和DALL·E 3?

Thanks to SD 3’s improvements in following cues, the model is able to accurately generate images that focus on a variety of different topics and features, while The image style also maintains a high degree of flexibility.

Stable Diffusion 3技术报告流出,Sora构架再立大功!生图圈开源暴打Midjourney和DALL·E 3?

Improving Rectified Flow through re-weighting method

In addition to the launch of the new Diffusion Transformer architecture , SD 3 also made significant improvements to the Diffusion model.

SD 3 adopts the Rectified Flow (RF) strategy to connect the training data and noise along a straight trajectory.

This method makes the model’s inference path more direct, so the sample generation can be completed in fewer steps.

Stable Diffusion 3技术报告流出,Sora构架再立大功!生图圈开源暴打Midjourney和DALL·E 3?

The author introduced an innovative trajectory sampling plan in the training process, especially increasing the weight of the middle part of the trajectory, and the prediction of these parts The mission is more challenging.

By comparing with 60 other diffusion trajectories (such as LDM, EDM, and ADM), the authors found that although the previous RF method performed better in fewer steps of sampling, as the sampling As the number of steps increases, performance will slowly decrease.

In order to avoid this situation, the weighted RF method proposed by the author can continue to improve model performance.

Extended RF Transformer model

Stability AI trained multiple models of different sizes, from 15 modules and 450M parameters to 38 modules and 8B parameters, and found the model Both size and training steps reduce validation loss smoothly.

To verify whether this meant a substantial improvement in model output, they also evaluated automatic image alignment metrics and human preference scores.

The results show that these evaluation indicators are strongly correlated with the verification loss, indicating that the verification loss is an effective indicator to measure the overall performance of the model.

In addition, this expansion trend has not reached a saturation point, making us optimistic that we can further improve model performance in the future.

Stable Diffusion 3技术报告流出,Sora构架再立大功!生图圈开源暴打Midjourney和DALL·E 3?

The author trained the model for 500k steps with different numbers of parameters at a resolution of 256 * 256 pixels and a batch size of 4096.

Stable Diffusion 3技术报告流出,Sora构架再立大功!生图圈开源暴打Midjourney和DALL·E 3?

The above figure illustrates the impact of training a larger model for a long time on sample quality.

The table above shows the results of GenEval. When using the training method proposed by the authors and increasing the resolution of the training images, the largest model performed well in most categories, surpassing DALL·E by 3 in the overall score.

According to the author's test comparison of different architecture models, MMDiT is very effective, surpassing DiT, Cross DiT, UViT, and MM-DiT.

Stable Diffusion 3技术报告流出,Sora构架再立大功!生图圈开源暴打Midjourney和DALL·E 3?

Flexible text encoder

By removing the memory-intensive 4.7B parameter T5 text encoder during the inference phase, SD 3's memory requirements are significantly reduced with minimal performance loss.

Removing this text encoder will not affect the visual beauty of the image (50% win rate without T5), but will only slightly reduce the ability of the text to follow accurately (46% win rate) .

However, in order to give full play to SD 3's ability to generate text, the author still recommends using the T5 encoder.

Because the author found that without it, the performance of typesetting to generate text would be even greater (win rate 38%).

Stable Diffusion 3技术报告流出,Sora构架再立大功!生图圈开源暴打Midjourney和DALL·E 3?

Netizens’ hot discussion

Netizens continue to tease users about Stability AI but refuse to use it They seemed a little impatient, and they all urged to put it online quickly for everyone to use.

Stable Diffusion 3技术报告流出,Sora构架再立大功!生图圈开源暴打Midjourney和DALL·E 3?

After reading the technical application, netizens said that it seems that the photography circle is now going to be the first track where open source will overwhelm closed source!

Stable Diffusion 3技术报告流出,Sora构架再立大功!生图圈开源暴打Midjourney和DALL·E 3?

The above is the detailed content of Stable Diffusion 3 technical report leaked out, Sora architecture has made great achievements again! Is the open source community violently beating Midjourney and DALL·E 3?. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island Adventure: How To Get Giant Seeds
1 months ago By 尊渡假赌尊渡假赌尊渡假赌
Two Point Museum: All Exhibits And Where To Find Them
1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

The world's most powerful open source MoE model is here, with Chinese capabilities comparable to GPT-4, and the price is only nearly one percent of GPT-4-Turbo The world's most powerful open source MoE model is here, with Chinese capabilities comparable to GPT-4, and the price is only nearly one percent of GPT-4-Turbo May 07, 2024 pm 04:13 PM

Imagine an artificial intelligence model that not only has the ability to surpass traditional computing, but also achieves more efficient performance at a lower cost. This is not science fiction, DeepSeek-V2[1], the world’s most powerful open source MoE model is here. DeepSeek-V2 is a powerful mixture of experts (MoE) language model with the characteristics of economical training and efficient inference. It consists of 236B parameters, 21B of which are used to activate each marker. Compared with DeepSeek67B, DeepSeek-V2 has stronger performance, while saving 42.5% of training costs, reducing KV cache by 93.3%, and increasing the maximum generation throughput to 5.76 times. DeepSeek is a company exploring general artificial intelligence

KAN, which replaces MLP, has been extended to convolution by open source projects KAN, which replaces MLP, has been extended to convolution by open source projects Jun 01, 2024 pm 10:03 PM

Earlier this month, researchers from MIT and other institutions proposed a very promising alternative to MLP - KAN. KAN outperforms MLP in terms of accuracy and interpretability. And it can outperform MLP running with a larger number of parameters with a very small number of parameters. For example, the authors stated that they used KAN to reproduce DeepMind's results with a smaller network and a higher degree of automation. Specifically, DeepMind's MLP has about 300,000 parameters, while KAN only has about 200 parameters. KAN has a strong mathematical foundation like MLP. MLP is based on the universal approximation theorem, while KAN is based on the Kolmogorov-Arnold representation theorem. As shown in the figure below, KAN has

Hello, electric Atlas! Boston Dynamics robot comes back to life, 180-degree weird moves scare Musk Hello, electric Atlas! Boston Dynamics robot comes back to life, 180-degree weird moves scare Musk Apr 18, 2024 pm 07:58 PM

Boston Dynamics Atlas officially enters the era of electric robots! Yesterday, the hydraulic Atlas just "tearfully" withdrew from the stage of history. Today, Boston Dynamics announced that the electric Atlas is on the job. It seems that in the field of commercial humanoid robots, Boston Dynamics is determined to compete with Tesla. After the new video was released, it had already been viewed by more than one million people in just ten hours. The old people leave and new roles appear. This is a historical necessity. There is no doubt that this year is the explosive year of humanoid robots. Netizens commented: The advancement of robots has made this year's opening ceremony look like a human, and the degree of freedom is far greater than that of humans. But is this really not a horror movie? At the beginning of the video, Atlas is lying calmly on the ground, seemingly on his back. What follows is jaw-dropping

AI subverts mathematical research! Fields Medal winner and Chinese-American mathematician led 11 top-ranked papers | Liked by Terence Tao AI subverts mathematical research! Fields Medal winner and Chinese-American mathematician led 11 top-ranked papers | Liked by Terence Tao Apr 09, 2024 am 11:52 AM

AI is indeed changing mathematics. Recently, Tao Zhexuan, who has been paying close attention to this issue, forwarded the latest issue of "Bulletin of the American Mathematical Society" (Bulletin of the American Mathematical Society). Focusing on the topic "Will machines change mathematics?", many mathematicians expressed their opinions. The whole process was full of sparks, hardcore and exciting. The author has a strong lineup, including Fields Medal winner Akshay Venkatesh, Chinese mathematician Zheng Lejun, NYU computer scientist Ernest Davis and many other well-known scholars in the industry. The world of AI has changed dramatically. You know, many of these articles were submitted a year ago.

Google is ecstatic: JAX performance surpasses Pytorch and TensorFlow! It may become the fastest choice for GPU inference training Google is ecstatic: JAX performance surpasses Pytorch and TensorFlow! It may become the fastest choice for GPU inference training Apr 01, 2024 pm 07:46 PM

The performance of JAX, promoted by Google, has surpassed that of Pytorch and TensorFlow in recent benchmark tests, ranking first in 7 indicators. And the test was not done on the TPU with the best JAX performance. Although among developers, Pytorch is still more popular than Tensorflow. But in the future, perhaps more large models will be trained and run based on the JAX platform. Models Recently, the Keras team benchmarked three backends (TensorFlow, JAX, PyTorch) with the native PyTorch implementation and Keras2 with TensorFlow. First, they select a set of mainstream

Time Series Forecasting NLP Large Model New Work: Automatically Generate Implicit Prompts for Time Series Forecasting Time Series Forecasting NLP Large Model New Work: Automatically Generate Implicit Prompts for Time Series Forecasting Mar 18, 2024 am 09:20 AM

Today I would like to share a recent research work from the University of Connecticut that proposes a method to align time series data with large natural language processing (NLP) models on the latent space to improve the performance of time series forecasting. The key to this method is to use latent spatial hints (prompts) to enhance the accuracy of time series predictions. Paper title: S2IP-LLM: SemanticSpaceInformedPromptLearningwithLLMforTimeSeriesForecasting Download address: https://arxiv.org/pdf/2403.05798v1.pdf 1. Large problem background model

FisheyeDetNet: the first target detection algorithm based on fisheye camera FisheyeDetNet: the first target detection algorithm based on fisheye camera Apr 26, 2024 am 11:37 AM

Target detection is a relatively mature problem in autonomous driving systems, among which pedestrian detection is one of the earliest algorithms to be deployed. Very comprehensive research has been carried out in most papers. However, distance perception using fisheye cameras for surround view is relatively less studied. Due to large radial distortion, standard bounding box representation is difficult to implement in fisheye cameras. To alleviate the above description, we explore extended bounding box, ellipse, and general polygon designs into polar/angular representations and define an instance segmentation mIOU metric to analyze these representations. The proposed model fisheyeDetNet with polygonal shape outperforms other models and simultaneously achieves 49.5% mAP on the Valeo fisheye camera dataset for autonomous driving

Tesla robots work in factories, Musk: The degree of freedom of hands will reach 22 this year! Tesla robots work in factories, Musk: The degree of freedom of hands will reach 22 this year! May 06, 2024 pm 04:13 PM

The latest video of Tesla's robot Optimus is released, and it can already work in the factory. At normal speed, it sorts batteries (Tesla's 4680 batteries) like this: The official also released what it looks like at 20x speed - on a small "workstation", picking and picking and picking: This time it is released One of the highlights of the video is that Optimus completes this work in the factory, completely autonomously, without human intervention throughout the process. And from the perspective of Optimus, it can also pick up and place the crooked battery, focusing on automatic error correction: Regarding Optimus's hand, NVIDIA scientist Jim Fan gave a high evaluation: Optimus's hand is the world's five-fingered robot. One of the most dexterous. Its hands are not only tactile

See all articles