Hardware requirements are getting lower and lower, and the generation speed is getting faster and faster.
As a pioneer of text-to-image, Stability AI not only leads the trend, but also continues to make new breakthroughs in model quality. This time, it achieved a breakthrough in cost performance.
Just a few days ago, Stability AI made another new move: the research preview version of Stable Cascade was launched. This text-to-image model innovates by introducing a three-stage approach that sets new benchmarks for quality, flexibility, fine-tuning, and efficiency, with a focus on further removing hardware barriers. In addition, Stability AI releases training and inference code that allows further customization of the model and its output. The model is available for inference in the diffusers library. This model is released under a non-commercial license, allowing non-commercial use only.
## Source: https://twitter.com/multimodalart/status/1757391981074903446Stable Cascade is extremely fast to generate. X platform user @GozukaraFurkan posted that it only requires about 9GB of GPU memory, and the speed can still be maintained well.
## Source: https://twitter.com/skirano/status/1757479638324883753Netizen During the generation process, it was found that the new model has significantly improved in composition and details, and text generation has made great progress: the accuracy of generating shorter words/phrases is relatively high, and long sentences can also be completed with a certain probability (English only). The integration of text and images is also very good.
## Picture source: https://twitter.com/ZHOZHO672070/status/1757779330443215065
##
#User @AIWarper tried a few different Artist Style Test.
prompt: Nightmare on Elm Street. Artist style references are as follows: Makoto Shinkai top left, Tomer Hanuka bottom left, Raphael Kirchner top right, Takato Yamamoto bottom right.However, when generating the character's face, you can find that the character's skin details are not very good, and it feels like "tenth-level skin grinding".
## Source: https://twitter.com/vitor_dlucca/status/1757511080287355093
##Technical DetailsStable Cascade differs from the Stable Diffusion model family in that it is built on a pipeline consisting of three different models: Stages A, B, and C. This architecture can perform hierarchical compression of images and utilize a highly compressed latent space to achieve superior output. How do these parts fit together?
The latent image generator stage (stage C) converts user input into a compact 24x24 latent representation, which is then passed to the latent decoder stage (stages A and B) for compressing the image, similar to VAE in Stable Diffusion works, but enables higher compression.
By decoupling text condition generation (stage C) from decoding to high-resolution pixel space (stages A and B), we can complete additional training or fine-tuning on stage C, including ControlNets and LoRA , the cost can be reduced to one sixteenth of that of training a similarly sized Stable Diffusion model. Stages A and B can optionally be fine-tuned for additional control, but this will be similar to fine-tuning the VAE in the Stable Diffusion model. In most cases, the benefits of doing so are minimal. Therefore, for most purposes, Stability AI officially recommends training only Phase C and using the original state from Phases A and B.
Phases C and B will release two different models: 1B and 3.6B parameter models for Phase C, and 700M and 1.5B parameter models for Phase B. A model with 3.6B parameters is recommended for Stage C as this model has the highest quality output. However, for those who wish to have the minimum hardware requirements, a 1B parameter version is available. For Stage B, both releases achieve good results, but the 1.5B parameter version performs better in terms of reconstruction detail. Thanks to Stable Cascade's modular approach, the expected VRAM requirements for inference can be kept to about 20GB. This can be further reduced by using smaller variants, with the caveat that this may also reduce the final output quality.
Comparison
In the evaluation, Stable Cascade performed best in terms of prompt alignment and aesthetic quality compared to almost all models compared. The figure below shows the results of human evaluation using a mixture of parti-prompts and aesthetic prompts:
Stable Cascade (30 inference steps) vs. Playground v2 (50 inference steps), SDXL (50 inference steps), SDXL Turbo (1 inference step) and Würstchen v2 (30 inference steps) compared
# Stable Cascade, SDXL, Playground V2 and SDXL Turbo Difference and higher compression potential space are demonstrated. Even though the largest model has 1.4B more parameters than Stable Diffusion XL, it still has faster inference times.
Additional FeaturesIn addition to standard text-to-image generation, Stable Cascade can also generate image variations and image-to-image generation. Image variants extract image embeddings from a given image using CLIP and return them to the model. The image below is sample output. The image on the left shows the original image, while the four to its right are the generated variants.
#Image to image generates an image by simply adding noise to a given image and then using that as a starting point. Below is an example of adding noise to the image on the left and then generating it from there.Code for training, fine-tuning, ControlNet and LoRA
With the release of Stable Cascade, Stability AI will be released for training ,fine-tuned all code for ControlNet and LoRA to reduce the ,requirements for further experimentation with this architecture. Here are some ControlNets that will be released with the model: Patch/Expand: Enter an image and add a mask to match the text prompt. The model will then fill in the masked portion of the image based on the provided text hints.
#Canny Edge: Generates a new image based on the edges of an existing image input to the model. According to Stability AI testing, it can also scale sketches.# This is the sketch of the input model, and the bottom is output results
## 2x super-resolution: Upscaling the resolution of an image to 2x its side length, e.g. converting a 1024 x 1024 image to a 2048 x 2048 output, can also be used for the latent representation generated by stage C. Do you like this price/performance ratio?
The above is the detailed content of The generation speed is twice as fast as SDXL, and it can also run on 9GB GPU. Stable Cascade is here to improve the price/performance ratio.. For more information, please follow other related articles on the PHP Chinese website!