Is it difficult to run Stable Diffusion on iPhone? In the article we are going to introduce today, the author gives the answer: it is not difficult, and the iPhone still has 50% performance left.
As we all know, every year Apple launches a new iPhone that claims to be faster and better in every way, mainly due to the rapid development of new vision models and image sensors. Take photography as an example. If you go back to 10 years ago, can you take high-quality pictures with an iPhone? The answer is no, because the development of technology is gradual. 10 years is enough to improve mobile phone photography technology.
Because of this (progressive) development pattern of technology, there will come a time when some programs will become almost unusable even on the best computing equipment. But these new programs with newly enabled scenarios attracted the attention of some users and people were willing to study it.
The author of this article is one of them. In the past 3 weeks, the author has developed an application that can generate (summon) images through Stable Diffusion, and then press your Edit it the way you like. The app takes just a minute to generate images on the latest iPhone 14 Pro, using about 2GiB of app memory, plus about 2GiB of initial data needs to be downloaded to get started.
App store link: https://apps.apple.com/us/app/draw-things-ai-generation/id6444050820
This result attracted many discussions among netizens. Some people began to worry about the power consumption of mobile phones, and joked: This is cool, but this seems to be a good way to consume mobile phone batteries.
"I have never been so happy to feel the heat of my iPhone."
"This In the cold winter, you can use your mobile phone as a hand warmer."
However, while everyone is making fun of the heating problem of mobile phones, they also give this work a very high rating.
"This is incredible. It takes about 45 seconds to generate a complete image on my iPhone SE3 - which is almost the same speed as the original version on my M1 Pro macbook Hurry!」
How is this done? Next, let’s take a look at the author’s implementation process:
If you want to run Stable Diffusion on the iPhone and still have 50% performance savings, a major challenge is that you need 6GiB RAM Run the program on your iPhone device. 6GiB sounds like a lot, but if you use more than 2.8GiB on a 6GiB device, or 2GiB on a 4GiB device, iOS will kill your app.
So how much memory does the Stable Diffusion model require for inference?
This also starts with the structure of the model. Usually the Stable Diffusion model contains 4 parts: 1. Text encoder, which generates text feature vectors to guide image generation; 2. Optional image encoder, which encodes images into latent space (for image-to-image generation); 3 . Denoiser model, which slowly denoises the latent representation of the image from the noise; 4. Image decoder, which decodes the image from the latent representation.
The 1st, 2nd and 4th modules are run once during inference and require a maximum of about 1GiB. The denoiser model takes about 3.2GiB (full floating point) and needs to be executed multiple times, so the author wants to keep the module in RAM longer.
The original Stable Diffusion model required close to 10GiB to perform single image inference. Between a single input (2x4x64x64) and an output (2x4x64x64), there are many output layers interspersed. Not all layer outputs can be reused immediately, some of them must retain some parameters for subsequent use (residual networks).
For some time, researchers have been optimizing PyTorch Stable Diffusion. They have reserved temporary storage space for the NVIDIA CUDNN and CUBLAS libraries used by PyTorch. These optimizations are all to reduce memory usage. Therefore the Stable Diffusion model can run with cards as low as 4GiB.
But it still exceeded the author’s expectations. Therefore, the author began to focus on Apple hardware and optimization.
At first, the author considered 3.2GiB or 1.6GiB half floating point number. If he did not want to trigger Apple's OOM (Out of Memory, which refers to the memory occupied by the App reaching the limit of the iOS system for a single After the app occupies the upper memory limit and is forcibly killed by the system), the author has about 500MiB of space to use.
The first question is, what is the size of each intermediate output?
It turns out that most of them are relatively small, under 6MiB each (2x320x64x64). The framework used by the author (s4nnc) can reasonably package them into less than 50MiB for reuse.
It is worth mentioning that the denoiser has a self-attention mechanism that takes its own latent representation of the image as input. During the self-attention computation, there is a batch matrix of size 16x4096x4096, which after applying softmax is about 500MiB in FP16 and can be done "inplace", which means it can safely rewrite its input without Will not be damaged. Fortunately, both Apple and NVIDIA low-level libraries provide inplace softmax implementations, whereas higher-level libraries such as PyTorch do not.
So can it really be done using about 550MiB and 1.6GiB of memory?
On Apple hardware, a common choice for implementing a neural network backend is to use the MPSGraph framework. So the author first tried to use MPSGraph to implement all neural network operations. The peak memory usage at FP16 precision is about 6GiB, which is obviously much more than the expected memory usage. What's going on?
The author analyzed the reasons in detail. First, he did not use MPSGraph in the common TensorFlow way. MPSGraph requires encoding the entire computational graph, then consuming input/output tensors, handling internal allocations, and letting the user submit the entire graph for execution.
The author uses MPSGraph very much like PyTorch - as an operation execution engine. To perform inference tasks, many compiled MPSGraphExecutables are executed on the Metal command queue, each of which may hold some intermediate allocated memory. If submitted in one go, all these commands hold allocated memory until they complete execution.
A simple way to solve this problem is to adjust the submission speed. There is no need to submit all commands at once. In fact, Metal has a limit of 64 concurrent submissions per queue. The author tried changing to submitting 8 operations at a time, and the peak memory was reduced to 4GiB.
However, that's still 2 GiB more than the iPhone can handle.
To compute self-attention using CUDA, there is a common trick in the original Stable Diffusion code implementation: use permutation instead of transpose. This trick works because CUBLAS can handle permuted strided tensors directly, avoiding the need to use dedicated memory to transpose the tensor.
But MPSGraph does not have strided tensor support, a permuted tensor will be transposed internally anyway, which requires intermediate memory allocation. By explicitly transposing, allocations will be handled by higher-level layers, avoiding MPSGraph internal inefficiencies. Using this trick, the memory usage will be close to 3GiB.
It turns out that starting with iOS 16.0, MPSGraph can no longer make optimal allocation decisions for softmax. Even if the input and output tensors both point to the same data, MPSGraph allocates an additional output tensor and then copies the result to the location pointed to.
The author found that using the Metal Performance Shaders alternative met the requirements perfectly and reduced the memory usage to 2.5GiB without any performance degradation.
On the other hand, MPSGraph's GEMM kernel requires internal transposition. Explicit transposes won't help here either, since these transposes are not "inplace" operations of higher-level layers, and for a specific 500MiB size tensor, this extra allocation is unavoidable. By switching to Metal Performance Shaders, the project authors reclaimed another 500MiB with a performance penalty of about 1%, ultimately reducing memory usage to the ideal 2GiB.
The above is the detailed content of Insert the Stable Diffusion model into the iPhone and make it into an APP to produce pictures in one minute. For more information, please follow other related articles on the PHP Chinese website!