Recently, diffusion models have surpassed GAN and autoregressive models and become the mainstream choice for generative models due to their excellent performance. Diffusion model-based text-to-image generation models such as SD, SDXL, Midjourney, and Imagen have demonstrated an amazing ability to generate high-quality images. Typically, these models are trained at a specific resolution to ensure efficient processing and accurate model training on existing hardware.
Figure 1: Comparison of different methods used to generate 2048×2048 images under SDXL 1.0. [1]
In these diffusion models, pattern duplication and severe artifacts often occur. For example, it is shown on the far left side of Figure 1. These problems are particularly acute beyond the training resolution.
In a paper, researchers from the Chinese University of Hong Kong SenseTime Joint Laboratory and other institutions conducted an in-depth study of the convolutional layer of the UNet structure commonly used in diffusion models, and analyzed the frequency FouriScale is proposed from the perspective of domain analysis, as shown in Figure 2.
Figure 2 Schematic diagram of FouriScale’s process (orange line) to ensure consistency across resolutions.
By introducing dilated convolution operations and low-pass filtering operations to replace the original convolutional layers in the pre-trained diffusion model, the structure and scale consistency at different resolutions can be achieved. Combined with the "fill then crop" strategy, this method can flexibly generate images that meet different sizes and aspect ratios. Furthermore, with FouriScale as a guide, this method is able to guarantee complete image structure and excellent image quality when generating high-resolution images of any size. FouriScale does not require any offline prediction calculations and has good compatibility and scalability.
Quantitative and qualitative experimental results demonstrate that FouriScale achieves significant improvements in generating high-resolution images using pre-trained diffusion models.
1. Atrous convolution ensures structural consistency across resolutions
The denoising network of the diffusion model is usually at a specific resolution. Trained on images or latent space, this network usually adopts U-Net structure. The authors aim to use the parameters of the denoising network during the inference stage to generate higher resolution images without the need for retraining. To avoid structural distortion at inference resolution, the authors try to establish structural consistency between default and high resolutions. For the convolutional layer in U-Net, the structural consistency can be expressed as:
where k is the original convolution kernel and k' is New convolution kernel customized for larger resolutions. According to the frequency domain representation of spatial downsampling, it is as follows:
Formula (3) can be written as:
This formula shows that the Fourier spectrum of the ideal convolution kernel k' should be spliced by the Fourier spectrum of s×s convolution kernels k. In other words, the Fourier spectrum of k' should have periodic repetition, and this repeating pattern is the Fourier spectrum of k.
The widely used dilated convolution just meets this requirement. The frequency domain periodicity of atrous convolution can be expressed by the following formula:
When using a pre-trained diffusion model (training resolution is (h, w)) to generate a high-resolution image of (H, W), the parameters of the atrous convolution Using the original convolution kernel, the expansion factor is (H/h, W/w), which is the ideal convolution kernel k'.
2. Low-pass filtering ensures scale consistency across resolutions
#However, only using hole volumes Product cannot solve the problem perfectly. As shown in the upper left corner of Figure 3, only using atrous convolution still has the phenomenon of pattern repetition in details. The author believes that this is because the frequency aliasing phenomenon of spatial downsampling changes the frequency domain components, resulting in differences in frequency domain distribution at different resolutions. In order to ensure scale consistency across resolutions, they introduced low-pass filtering to filter out high-frequency components to remove the frequency aliasing problem after spatial downsampling. As can be seen from the comparison curve on the right side of Figure 3, after using low-pass filtering, the frequency distribution at high and low resolutions is closer, thus ensuring consistent scale. As can be seen from the lower left corner of Figure 3, after using low-pass filtering, the pattern repetition phenomenon of details has been significantly improved.
Figure 3 (a) Visual comparison with or without low-pass filtering. (b) Fourier relative logarithmic amplitude curve without low-pass filtering. (c) Fourier relative logarithmic amplitude curve with low-pass filtering.
3. Suitable for image generation of any size
The above method can only In order to adapt FouriScale to image generation of any size when the aspect ratio of the generated resolution is consistent with the default inference resolution, the author adopts a "fill and then crop" method. Method 1 shows the combination of this strategy Pseudocode of FouriScale.
4. FouriScale guide
Due to The frequency domain operation in FouriScale inevitably causes loss of detail and undesirable artifacts in the generated images. In order to solve this problem, as shown in Figure 4, the author proposed FouriScale as a guidance method. Specifically, based on the original conditional generation estimation and unconditional generation estimation, they introduced an additional conditional generation estimation. The generation process of this additional conditional generation estimate also uses atrous convolution, but uses a gentler low-pass filtering to ensure that details are not lost. At the same time, they will use the attention score in the conditional generation estimate output by FouriScale to replace the attention score in this additional conditional generation estimate. Since the attention score contains the structural information in the generated image, this operation will correctly The image structure information is introduced while ensuring the image quality.
Figure 4 (a) FouriScale boot diagram. (b) The generated image without using FouriScale as a guide has obvious artifacts and detail errors. (c) Generated image using FouriScale as guide.
1. Quantitative test results
The author followed the method of [1] and tested three Vincentian graph models (including SD 1.5, SD 2.1 and SDXL 1.0) to generate four higher resolution images. The tested resolutions were 4x, 6.25x, 8x, and 16x the number of pixels of their respective training resolutions. The results of randomly sampling 30000/10000 image and text pairs on Laion-5B are shown in Table 1:
Table 1 Different training is not required Comparison of quantitative results of methods
Their method achieved optimal results in each pre-trained model and at different resolutions.
2. Qualitative test results
As shown in Figure 5, their method In each pre-trained model, image generation quality and consistent structure can be guaranteed at different resolutions.
Figure 5 Comparison of generated images by different training-free methods
This paper proposes FouriScale to enhance the ability of pre-trained diffusion models to generate high-resolution images. FouriScale is analyzed from the frequency domain and improves the structure and scale consistency at different resolutions through atrous convolution and low-pass filtering operations, solving key challenges such as repeated patterns and structural distortion. Adopting a "fill then crop" strategy and using FouriScale as a guide enhances the flexibility and quality of text-to-image generation while adapting to different aspect ratio generation. Quantitative and qualitative experimental comparisons show that FouriScale can ensure higher image generation quality under different pre-trained models and different resolutions.
The above is the detailed content of Without training, this new method achieves freedom in generating image sizes and resolutions.. For more information, please follow other related articles on the PHP Chinese website!