GAN's counterattack: Zhu Junyan's new CVPR work GigaGAN, the image output speed beats Stable Diffusion-AI-php.cn

Home

Technology peripherals

GAN's counterattack: Zhu Junyan's new CVPR work GigaGAN, the image output speed beats Stable Diffusion

PHPz

Apr 12, 2023 pm 03:43 PM

Model text

Image generation is one of the most popular directions in the current AIGC field. Recently released image generation models such as DALL・E 2, Imagen, Stable Diffusion, etc. have ushered in a new era of image generation, achieving unprecedented levels of image quality and model flexibility. The diffusion model has also become the dominant paradigm at present. However, diffusion models rely on iterative inference, which is a double-edged sword because iterative methods can achieve stable training with simple objectives, but the inference process requires high computational costs.

Before diffusion models, generative adversarial networks (GANs) were a commonly used infrastructure in image generation models. Compared to diffusion models, GANs generate images through a single forward pass and are therefore inherently more efficient, but due to the instability of the training process, scaling GANs requires careful tuning of network architecture and training factors. Therefore, GANs are good at modeling single or multiple object classes, but are extremely challenging to scale to complex data sets (let alone the real world). As a result, very large model, data, and computational resources are now dedicated to diffusion and autoregressive models.

But as an efficient generation method, many researchers have not completely abandoned the GAN method. For example, NVIDIA recently proposed the StyleGAN-T model; Hong Kong Chinese and others used GAN-based methods to generate smooth videos. These are further attempts by CV researchers on GAN.

Now, in a CVPR 2023 paper, researchers from POSTECH, Carnegie Mellon University, and Adobe Research jointly explored several important issues about GANs, including:

#Can GAN continue to scale and benefit from massive resources? Has GAN hit a bottleneck?
What prevents further expansion of GANs, and can we overcome these obstacles?

GANs counterattack: Zhu Junyans new CVPR work GigaGAN, the image output speed beats Stable Diffusion

Paper link: https://arxiv.org/abs/2303.05511
Project link: https://mingukkang.github.io/GigaGAN/

It is worth noting that Zhu Junyan, the main author of CycleGAN and winner of the 2018 ACM SIGGRAPH Best Doctoral Thesis Award, is the second author of this CVPR paper.

The study first conducted experiments using StyleGAN2 and observed that simply extending the backbone network resulted in unstable training. Based on this, the researchers identified several key issues and proposed a technique to stabilize training while increasing model capacity.

First, this study effectively expands the capacity of the generator by retaining a set of filters and employing sample-specific linear combinations. The study also adopted several techniques commonly used in diffusion context and confirmed that they bring similar benefits to GANs. For example, intertwining self-attention (image only) and cross-attention (image-text) with convolutional layers can improve model performance.

The research also reintroduces multi-scale training and proposes a new scheme to improve image-text alignment and generate low-frequency details of the output. Multi-scale training allows GAN-based generators to use parameters in low-resolution blocks more efficiently, resulting in better image-text alignment and image quality. After careful adjustment, this study proposes a new model GigaGAN with one billion parameters and achieves stable and scalable training on large datasets (such as LAION2B-en). The experimental results are shown in Figure 1 below.

GANs counterattack: Zhu Junyans new CVPR work GigaGAN, the image output speed beats Stable Diffusion

In addition, this study also adopted a multi-stage method [14, 104], first with a low resolution of 64 × 64 The image is generated at 512 × 512 resolution and then upsampled to 512 × 512 resolution. Both networks are modular and powerful enough to be used in a plug-and-play manner.

This study demonstrates that text-conditioned GAN upsampling networks can be used as efficient and higher-quality upsamplers for underlying diffusion models, as shown in Figures 2 and 3 below.

GANs counterattack: Zhu Junyans new CVPR work GigaGAN, the image output speed beats Stable Diffusion

The above improvements make GigaGAN far beyond previous GANs: 36 times larger than StyleGAN2 and 6 times larger than StyleGAN-XL and XMC-GAN . While GigaGAN's parameter count of one billion (1B) is still lower than that of recent large synthetic models such as Imagen (3.0B), DALL・E 2 (5.5B), and Parti (20B), the researchers say they have not yet observed any significant changes in the model's The quality of the size is saturated.

GigaGAN achieves a zero-sample FID of 9.09 on the COCO2014 dataset, which is lower than DALL・E 2, Parti-750M and Stable Diffusion.

In addition, compared with diffusion models and autoregressive models, GigaGAN has three major practical advantages. First, it is dozens of times faster, producing a 512-pixel image in 0.13 seconds (Figure 1). Second, it can synthesize ultra-high-resolution images at 4k resolution in just 3.66 seconds. Third, it has a controllable latent vector space that is suitable for well-studied controllable image synthesis applications, such as style blending (Figure 6), prompt interpolation (Figure 7), and prompt blending (Figure 8).

GANs counterattack: Zhu Junyans new CVPR work GigaGAN, the image output speed beats Stable Diffusion

#This study successfully trained the GAN-based billion-parameter scale model GigaGAN on billions of real-world images. This suggests that GANs remain a viable option for text-to-image synthesis and that researchers should consider them for aggressive future expansion.

Method Overview

The researcher trained a generator G (z, c), given a potential encoding z∼N (0, 1)∈R ^128 and text conditioning signal c, predict an image x∈R^(H×W×3). They use a discriminator D(x, c) to judge the authenticity of the generated images compared to samples in a training database D, which contains image-text pairs.

Although GANs can successfully generate realistic images on single- and multi-class datasets, open text conditional synthesis on Internet images still faces challenges. The researchers hypothesize that the current limitations stem from its reliance on convolutional layers. That is, the same convolutional filter is used to model a universal image synthesis function for all text conditions at all locations in the image, which is a challenge. In view of this, researchers try to inject more expressiveness into parameterization by dynamically selecting convolution filters based on input conditions and capturing long-range dependencies through attention mechanisms.

GigaGAN High Volume Text-Image Generator is shown in Figure 4 below. First, we use a pre-trained CLIP model and a learned encoder T to extract text embeddings. Feed local text descriptors to the generator using cross-attention. The global text descriptor, together with the latent code z, is fed into the style mapping network M to produce the style code w. The style code adjusts the main generator using the style from the paper - adaptive kernel selection, shown on the right.

The generator outputs an image pyramid by converting intermediate features into RGB images. To achieve higher capacity, we use multiple attention and convolutional layers at each scale (Appendix A2). They also used a separate upsampler model, which is not shown in this figure.

GANs counterattack: Zhu Junyans new CVPR work GigaGAN, the image output speed beats Stable Diffusion

The discriminator consists of two branches for processing image and text conditioning t_D. The text branch handles text similarly to the generator (Figure 4). The image branch receives an image pyramid and makes independent predictions for each image scale. Furthermore, predictions are made at all subsequent scales in the downsampling layer, making it a multi-scale input, multi-scale output (MS-I/O) discriminator.

GANs counterattack: Zhu Junyans new CVPR work GigaGAN, the image output speed beats Stable Diffusion Experimental results

In the paper, the author recorded five different experiments.

In the first experiment, they demonstrated the effectiveness of the proposed method by incorporating each technical component one by one.

GANs counterattack: Zhu Junyans new CVPR work GigaGAN, the image output speed beats Stable Diffusion

In the second experiment, they tested the model’s ability to generate graphs, and the results showed that GigaGAN performed better than Stable Diffusion (SD-v1.5) is comparable to FID while producing results much faster than diffusion or autoregressive models.

GANs counterattack: Zhu Junyans new CVPR work GigaGAN, the image output speed beats Stable Diffusion

In the third experiment, they compared GigaGAN with a distillation-based diffusion model, and the results showed that GigaGAN was more efficient than distillation-based diffusion. Models synthesize higher quality images faster.

GANs counterattack: Zhu Junyans new CVPR work GigaGAN, the image output speed beats Stable Diffusion

In the fourth experiment, they verified that GigaGAN’s upsampler achieved conditional and unconditional super-resolution. Advantages over other upsamplers in rate tasks.

GANs counterattack: Zhu Junyans new CVPR work GigaGAN, the image output speed beats Stable Diffusion

##Finally, they presented their Large-scale GAN models still enjoy the continuous and disentangled latent space operations of GAN, thus enabling new image editing modes. See Figures 6 and 8 above for diagrams.

The above is the detailed content of GAN's counterattack: Zhu Junyan's new CVPR work GigaGAN, the image output speed beats Stable Diffusion. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

1 months ago By DDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

3 weeks ago By DDD

Where to find the Crane Control Keycard in Atomfall

1 months ago By DDD

How to fix KB5055523 fails to install in Windows 11?

2 weeks ago By DDD

InZoi: How To Apply To School And University

3 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7750

Java Tutorial

1643

CakePHP Tutorial

1397

Laravel Tutorial

1293

PHP Tutorial

1234

Related knowledge

The world's most powerful open source MoE model is here, with Chinese capabilities comparable to GPT-4, and the price is only nearly one percent of GPT-4-Turbo May 07, 2024 pm 04:13 PM

Imagine an artificial intelligence model that not only has the ability to surpass traditional computing, but also achieves more efficient performance at a lower cost. This is not science fiction, DeepSeek-V2[1], the world’s most powerful open source MoE model is here. DeepSeek-V2 is a powerful mixture of experts (MoE) language model with the characteristics of economical training and efficient inference. It consists of 236B parameters, 21B of which are used to activate each marker. Compared with DeepSeek67B, DeepSeek-V2 has stronger performance, while saving 42.5% of training costs, reducing KV cache by 93.3%, and increasing the maximum generation throughput to 5.76 times. DeepSeek is a company exploring general artificial intelligence

AI subverts mathematical research! Fields Medal winner and Chinese-American mathematician led 11 top-ranked papers | Liked by Terence Tao Apr 09, 2024 am 11:52 AM

AI is indeed changing mathematics. Recently, Tao Zhexuan, who has been paying close attention to this issue, forwarded the latest issue of "Bulletin of the American Mathematical Society" (Bulletin of the American Mathematical Society). Focusing on the topic "Will machines change mathematics?", many mathematicians expressed their opinions. The whole process was full of sparks, hardcore and exciting. The author has a strong lineup, including Fields Medal winner Akshay Venkatesh, Chinese mathematician Zheng Lejun, NYU computer scientist Ernest Davis and many other well-known scholars in the industry. The world of AI has changed dramatically. You know, many of these articles were submitted a year ago.

Google is ecstatic: JAX performance surpasses Pytorch and TensorFlow! It may become the fastest choice for GPU inference training Apr 01, 2024 pm 07:46 PM

The performance of JAX, promoted by Google, has surpassed that of Pytorch and TensorFlow in recent benchmark tests, ranking first in 7 indicators. And the test was not done on the TPU with the best JAX performance. Although among developers, Pytorch is still more popular than Tensorflow. But in the future, perhaps more large models will be trained and run based on the JAX platform. Models Recently, the Keras team benchmarked three backends (TensorFlow, JAX, PyTorch) with the native PyTorch implementation and Keras2 with TensorFlow. First, they select a set of mainstream

Hello, electric Atlas! Boston Dynamics robot comes back to life, 180-degree weird moves scare Musk Apr 18, 2024 pm 07:58 PM

Boston Dynamics Atlas officially enters the era of electric robots! Yesterday, the hydraulic Atlas just "tearfully" withdrew from the stage of history. Today, Boston Dynamics announced that the electric Atlas is on the job. It seems that in the field of commercial humanoid robots, Boston Dynamics is determined to compete with Tesla. After the new video was released, it had already been viewed by more than one million people in just ten hours. The old people leave and new roles appear. This is a historical necessity. There is no doubt that this year is the explosive year of humanoid robots. Netizens commented: The advancement of robots has made this year's opening ceremony look like a human, and the degree of freedom is far greater than that of humans. But is this really not a horror movie? At the beginning of the video, Atlas is lying calmly on the ground, seemingly on his back. What follows is jaw-dropping

KAN, which replaces MLP, has been extended to convolution by open source projects Jun 01, 2024 pm 10:03 PM

Earlier this month, researchers from MIT and other institutions proposed a very promising alternative to MLP - KAN. KAN outperforms MLP in terms of accuracy and interpretability. And it can outperform MLP running with a larger number of parameters with a very small number of parameters. For example, the authors stated that they used KAN to reproduce DeepMind's results with a smaller network and a higher degree of automation. Specifically, DeepMind's MLP has about 300,000 parameters, while KAN only has about 200 parameters. KAN has a strong mathematical foundation like MLP. MLP is based on the universal approximation theorem, while KAN is based on the Kolmogorov-Arnold representation theorem. As shown in the figure below, KAN has

Tesla robots work in factories, Musk: The degree of freedom of hands will reach 22 this year! May 06, 2024 pm 04:13 PM

The latest video of Tesla's robot Optimus is released, and it can already work in the factory. At normal speed, it sorts batteries (Tesla's 4680 batteries) like this: The official also released what it looks like at 20x speed - on a small "workstation", picking and picking and picking: This time it is released One of the highlights of the video is that Optimus completes this work in the factory, completely autonomously, without human intervention throughout the process. And from the perspective of Optimus, it can also pick up and place the crooked battery, focusing on automatic error correction: Regarding Optimus's hand, NVIDIA scientist Jim Fan gave a high evaluation: Optimus's hand is the world's five-fingered robot. One of the most dexterous. Its hands are not only tactile

FisheyeDetNet: the first target detection algorithm based on fisheye camera Apr 26, 2024 am 11:37 AM

Target detection is a relatively mature problem in autonomous driving systems, among which pedestrian detection is one of the earliest algorithms to be deployed. Very comprehensive research has been carried out in most papers. However, distance perception using fisheye cameras for surround view is relatively less studied. Due to large radial distortion, standard bounding box representation is difficult to implement in fisheye cameras. To alleviate the above description, we explore extended bounding box, ellipse, and general polygon designs into polar/angular representations and define an instance segmentation mIOU metric to analyze these representations. The proposed model fisheyeDetNet with polygonal shape outperforms other models and simultaneously achieves 49.5% mAP on the Valeo fisheye camera dataset for autonomous driving

DualBEV: significantly surpassing BEVFormer and BEVDet4D, open the book! Mar 21, 2024 pm 05:21 PM

This paper explores the problem of accurately detecting objects from different viewing angles (such as perspective and bird's-eye view) in autonomous driving, especially how to effectively transform features from perspective (PV) to bird's-eye view (BEV) space. Transformation is implemented via the Visual Transformation (VT) module. Existing methods are broadly divided into two strategies: 2D to 3D and 3D to 2D conversion. 2D-to-3D methods improve dense 2D features by predicting depth probabilities, but the inherent uncertainty of depth predictions, especially in distant regions, may introduce inaccuracies. While 3D to 2D methods usually use 3D queries to sample 2D features and learn the attention weights of the correspondence between 3D and 2D features through a Transformer, which increases the computational and deployment time.

See all articles