Stable Diffusion is as well-known in the field of image generation as ChatGPT in the conversation large model. It is capable of creating realistic images of any given input text in tens of seconds. Because Stable Diffusion has more than 1 billion parameters, and due to limited computing and memory resources on the device, this model is primarily run in the cloud.
Without careful design and implementation, running these models on a device may result in increased latency due to the iterative denoising process and excessive memory consumption.
How to run Stable Diffusion on the device has aroused everyone's research interest. Previously, some researchers developed an application that uses Stable Diffusion to generate images on the iPhone 14 Pro. Takes one minute and uses approximately 2GiB of application memory.
Apple has also made some optimizations to this before. They can generate an image with a resolution of 512x512 in half a minute on iPhone, iPad, Mac and other devices. Qualcomm follows closely behind, running Stable Diffusion v1.5 on Android phones, generating images with a resolution of 512x512 in less than 15 seconds.
Recently, in a paper published by Google "Speed Is All You Need: On-Device Acceleration of Large Diffusion Models via GPU-Aware Optimizations", they implemented a GPU-driven Stable Diffusion 1.4 is run on the device, achieving SOTA inference latency performance (on Samsung S23 Ultra, it only takes 11.5 seconds to generate a 512 × 512 image through 20 iterations). Furthermore, this study is not specific to one device; rather, it is a general approach applicable to improving all potential diffusion models.
This research opens up many possibilities for running generative AI locally on your phone, without a data connection or cloud server. Stable Diffusion was only released last fall, and it can already be plugged into devices and run today, which shows how fast this field is developing.
##Paper address: https://arxiv.org/pdf/2304.11267.pdf
In order to achieve this generation speed, Google has put forward some optimization suggestions. Let’s take a look at how Google optimizes.
Method introductionThis research aims to propose optimization methods to improve the speed of large-scale diffusion model Vincentian diagrams. Some optimization suggestions are proposed for Stable Diffusion. These optimization suggestions are also Suitable for other large diffusion models.
First let’s take a look at the main components of Stable Diffusion, including: text embedder (text embedder), noise generation (noise generation), denoising neural network (denoising neural network) and Image decoder (image decoder, as shown in Figure 1 below.
##Then let’s take a closer look at the three issues proposed in this study. An optimization method.
Specialized kernel: Group Norm and GELU
Group Normalization (GN) method The working principle is to divide the channels of the feature map into smaller groups and normalize each group independently, thus making GN less dependent on batch size and more suitable for various batch sizes and network architectures. . Instead of performing reshape, mean, variance, and normalization operations in sequence, this research designed a unique GPU shader form of kernel that can perform all these operations in one GPU command without any intermediate Tensor.Gaussian error linear unit (GELU), as a commonly used model activation function, contains a large number of numerical calculations, such as multiplication, addition and Gaussian error function. This study uses a A dedicated shader to integrate these numerical calculations and their accompanying split and multiplication operations so that they can be performed in a single AI paint call.
Improving the efficiency of the attention module The text-to-image transformer in Stable Diffusion helps model conditional distributions, which is crucial for text-to-image generation tasks. However, self/cross-attention mechanisms encounter difficulties in processing long sequences due to memory complexity and time complexity. Based on this, this study proposes two optimization methods to alleviate the computational bottleneck. On the one hand, in order to avoid performing the entire softmax calculation on a large matrix, this study uses a GPU shader to reduce computational operations, which greatly reduces the memory footprint and overall latency of the intermediate tensor. The specific method is shown in Figure 2 below.
On the other hand, this study uses FlashAttention [7], an IO-aware precise attention algorithm, which enables high Bandwidth Memory (HBM) requires fewer accesses than standard attention mechanisms, improving overall efficiency.
Winograd Convolution
Winograd convolution converts the convolution operation into a series of matrix multiplications. This method can reduce many multiplication operations and improve calculation efficiency. However, this also increases memory consumption and numerical errors, especially when using larger tiles.
The backbone of Stable Diffusion relies heavily on 3×3 convolutional layers, especially in the image decoder, where they account for 90%. This study provides an in-depth analysis of this phenomenon to explore the potential benefits of using Winograd with different tile sizes on 3 × 3 kernel convolutions. Research has found that a tile size of 4 × 4 is optimal as it provides the best balance between computational efficiency and memory utilization.
ExperimentationThe study was benchmarked on a variety of devices: Samsung S23 Ultra (Adreno 740) and iPhone 14 Pro Max (A16). The benchmark results are shown in Table 1 below:
It is obvious that as each optimization is activated, the latency gradually decreases (It can be understood that the time to generate images is reduced). Specifically, compared to the baseline: 52.2% latency reduction on Samsung S23 Ultra; 32.9% latency reduction on iPhone 14 Pro Max. In addition, the study also evaluates the end-to-end latency of Samsung S23 Ultra, generating a 512 × 512 pixel image within 20 denoising iteration steps, achieving SOTA results in less than 12 seconds.
Small devices can run their own generative artificial intelligence models. What does this mean for the future? We can expect a wave.
The above is the detailed content of Google is optimizing the diffusion model. Samsung mobile phones run Stable Diffusion and produce images in 12 seconds.. For more information, please follow other related articles on the PHP Chinese website!