


Google is optimizing the diffusion model. Samsung mobile phones run Stable Diffusion and produce images in 12 seconds.
Stable Diffusion is as well-known in the field of image generation as ChatGPT in the conversation large model. It is capable of creating realistic images of any given input text in tens of seconds. Because Stable Diffusion has more than 1 billion parameters, and due to limited computing and memory resources on the device, this model is primarily run in the cloud.
Without careful design and implementation, running these models on a device may result in increased latency due to the iterative denoising process and excessive memory consumption.
How to run Stable Diffusion on the device has aroused everyone's research interest. Previously, some researchers developed an application that uses Stable Diffusion to generate images on the iPhone 14 Pro. Takes one minute and uses approximately 2GiB of application memory.
Apple has also made some optimizations to this before. They can generate an image with a resolution of 512x512 in half a minute on iPhone, iPad, Mac and other devices. Qualcomm follows closely behind, running Stable Diffusion v1.5 on Android phones, generating images with a resolution of 512x512 in less than 15 seconds.
Recently, in a paper published by Google "Speed Is All You Need: On-Device Acceleration of Large Diffusion Models via GPU-Aware Optimizations", they implemented a GPU-driven Stable Diffusion 1.4 is run on the device, achieving SOTA inference latency performance (on Samsung S23 Ultra, it only takes 11.5 seconds to generate a 512 × 512 image through 20 iterations). Furthermore, this study is not specific to one device; rather, it is a general approach applicable to improving all potential diffusion models.
This research opens up many possibilities for running generative AI locally on your phone, without a data connection or cloud server. Stable Diffusion was only released last fall, and it can already be plugged into devices and run today, which shows how fast this field is developing.
##Paper address: https://arxiv.org/pdf/2304.11267.pdf
In order to achieve this generation speed, Google has put forward some optimization suggestions. Let’s take a look at how Google optimizes.
Method introductionThis research aims to propose optimization methods to improve the speed of large-scale diffusion model Vincentian diagrams. Some optimization suggestions are proposed for Stable Diffusion. These optimization suggestions are also Suitable for other large diffusion models.
First let’s take a look at the main components of Stable Diffusion, including: text embedder (text embedder), noise generation (noise generation), denoising neural network (denoising neural network) and Image decoder (image decoder, as shown in Figure 1 below.
Specialized kernel: Group Norm and GELU
Group Normalization (GN) method The working principle is to divide the channels of the feature map into smaller groups and normalize each group independently, thus making GN less dependent on batch size and more suitable for various batch sizes and network architectures. . Instead of performing reshape, mean, variance, and normalization operations in sequence, this research designed a unique GPU shader form of kernel that can perform all these operations in one GPU command without any intermediate Tensor.Gaussian error linear unit (GELU), as a commonly used model activation function, contains a large number of numerical calculations, such as multiplication, addition and Gaussian error function. This study uses a A dedicated shader to integrate these numerical calculations and their accompanying split and multiplication operations so that they can be performed in a single AI paint call.
Improving the efficiency of the attention module The text-to-image transformer in Stable Diffusion helps model conditional distributions, which is crucial for text-to-image generation tasks. However, self/cross-attention mechanisms encounter difficulties in processing long sequences due to memory complexity and time complexity. Based on this, this study proposes two optimization methods to alleviate the computational bottleneck. On the one hand, in order to avoid performing the entire softmax calculation on a large matrix, this study uses a GPU shader to reduce computational operations, which greatly reduces the memory footprint and overall latency of the intermediate tensor. The specific method is shown in Figure 2 below.
On the other hand, this study uses FlashAttention [7], an IO-aware precise attention algorithm, which enables high Bandwidth Memory (HBM) requires fewer accesses than standard attention mechanisms, improving overall efficiency.
Winograd Convolution
Winograd convolution converts the convolution operation into a series of matrix multiplications. This method can reduce many multiplication operations and improve calculation efficiency. However, this also increases memory consumption and numerical errors, especially when using larger tiles.
The backbone of Stable Diffusion relies heavily on 3×3 convolutional layers, especially in the image decoder, where they account for 90%. This study provides an in-depth analysis of this phenomenon to explore the potential benefits of using Winograd with different tile sizes on 3 × 3 kernel convolutions. Research has found that a tile size of 4 × 4 is optimal as it provides the best balance between computational efficiency and memory utilization.
The study was benchmarked on a variety of devices: Samsung S23 Ultra (Adreno 740) and iPhone 14 Pro Max (A16). The benchmark results are shown in Table 1 below:
It is obvious that as each optimization is activated, the latency gradually decreases (It can be understood that the time to generate images is reduced). Specifically, compared to the baseline: 52.2% latency reduction on Samsung S23 Ultra; 32.9% latency reduction on iPhone 14 Pro Max. In addition, the study also evaluates the end-to-end latency of Samsung S23 Ultra, generating a 512 × 512 pixel image within 20 denoising iteration steps, achieving SOTA results in less than 12 seconds.
Small devices can run their own generative artificial intelligence models. What does this mean for the future? We can expect a wave.
The above is the detailed content of Google is optimizing the diffusion model. Samsung mobile phones run Stable Diffusion and produce images in 12 seconds.. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



DeepSeek is a powerful information retrieval tool. Its advantage is that it can deeply mine information, but its disadvantages are that it is slow, the result presentation method is simple, and the database coverage is limited. It needs to be weighed according to specific needs.

DeepSeek is a proprietary search engine that only searches in a specific database or system, faster and more accurate. When using it, users are advised to read the document, try different search strategies, seek help and feedback on the user experience in order to make the most of their advantages.

This article introduces the registration process of the Sesame Open Exchange (Gate.io) web version and the Gate trading app in detail. Whether it is web registration or app registration, you need to visit the official website or app store to download the genuine app, then fill in the user name, password, email, mobile phone number and other information, and complete email or mobile phone verification.

Why can’t the Bybit exchange link be directly downloaded and installed? Bybit is a cryptocurrency exchange that provides trading services to users. The exchange's mobile apps cannot be downloaded directly through AppStore or GooglePlay for the following reasons: 1. App Store policy restricts Apple and Google from having strict requirements on the types of applications allowed in the app store. Cryptocurrency exchange applications often do not meet these requirements because they involve financial services and require specific regulations and security standards. 2. Laws and regulations Compliance In many countries, activities related to cryptocurrency transactions are regulated or restricted. To comply with these regulations, Bybit Application can only be used through official websites or other authorized channels

A detailed introduction to the login operation of the Sesame Open Exchange web version, including login steps and password recovery process. It also provides solutions to common problems such as login failure, unable to open the page, and unable to receive verification codes to help you log in to the platform smoothly.

It is crucial to choose a formal channel to download the app and ensure the safety of your account.

This article recommends the top ten cryptocurrency trading platforms worth paying attention to, including Binance, OKX, Gate.io, BitFlyer, KuCoin, Bybit, Coinbase Pro, Kraken, BYDFi and XBIT decentralized exchanges. These platforms have their own advantages in terms of transaction currency quantity, transaction type, security, compliance, and special features. For example, Binance is known for its largest transaction volume and abundant functions in the world, while BitFlyer attracts Asian users with its Japanese Financial Hall license and high security. Choosing a suitable platform requires comprehensive consideration based on your own trading experience, risk tolerance and investment preferences. Hope this article helps you find the best suit for yourself

To access the latest version of Binance website login portal, just follow these simple steps. Go to the official website and click the "Login" button in the upper right corner. Select your existing login method. If you are a new user, please "Register". Enter your registered mobile number or email and password and complete authentication (such as mobile verification code or Google Authenticator). After successful verification, you can access the latest version of Binance official website login portal.
