The recent DALLE-2 released by OpenAI and Imagen released by Google have achieved stunning text-to-image generation effects, which have attracted widespread attention and spawned many interesting applications. Text-to-image generation is a typical task in the field of multi-modal image synthesis and editing. Recently, researchers from Max Planck Institute, Nanyang Technological Institute and other institutions conducted a detailed investigation and analysis on the research status and future development of the large field of multi-modal image synthesis and editing.
#In the first chapter, this review describes the significance and overall development of multi-modal image synthesis and editing tasks, as well as the contributions of this paper and The overall structure.
In the second chapter, based on the data modalities that guide image synthesis and editing, this review paper introduces the more commonly used visual guidance (such as semantic maps, key point maps, edge maps ), text guidance, voice guidance, scene graph guidance and corresponding modal data processing methods and a unified representation framework.
In the third chapter, according to the model framework of image synthesis and editing, the paper classifies various current methods, including GAN-based methods, autoregressive methods, diffusion model method, and neural radiation field (NeRF) method.
##Since GAN-based methods generally use conditional GAN and unconditional GAN inversion, this paper will One category is further divided into intra-modal conditions (e.g. semantic map, edge map), cross-modal conditions (e.g. text and speech), and GAN inversion (unified modality) and described in detail.
Compared with GAN-based methods, the autoregressive model method can process multi-modal data more naturally and utilize the currently popular Transformer model. . Autoregressive methods generally first learn a vector quantization encoder to discretely represent images as token sequences, and then autoregressively model the distribution of tokens. Since data such as text and speech can be represented as tokens and used as conditions for autoregressive modeling, various multi-modal image synthesis and editing tasks can be unified into a single framework.
The above methods mainly focus on multi-modal synthesis and editing of 2D images. With the recent rapid development of Neural Radiation Fields (NeRF), multi-modal synthesis and editing for 3D perception have attracted more and more attention. Multimodal synthesis and editing for 3D perception is a more challenging task due to the need to consider multi-view consistency. This paper classifies and summarizes the existing work on three methods of single-scene optimization NeRF, generative NeRF and NeRF inversion.
Subsequently, this review compares and discusses the above four model methods. Overall, current state-of-the-art models favor autoregressive and diffusion models over GANs. The application of NeRF in multi-modal synthesis and editing tasks opens a new window for research in this field.
In Chapter 4, this review brings together popular data in the field of multimodal synthesis and editing Sets and corresponding modal annotations are provided, and current methods are quantitatively compared for typical tasks of each modality (semantic image synthesis, text-to-image synthesis, and voice-guided image editing).
In Chapter 5, the review discusses and analyzes the current challenges and future directions in this field, including large-scale multi-modal data sets, accurate and reliable evaluation indicators , efficient network architecture, and the development direction of 3D perception.
In Chapters 6 and 7, the review elaborates on the potential social impact of this field and summarizes the content and contributions of the article respectively.
The above is the detailed content of Multimodal image synthesis and editing are so popular that the Max Planck Institute, Nanyang Technological Institute and others have published a detailed review. For more information, please follow other related articles on the PHP Chinese website!