Five promising AI models for image translation-AI-php.cn

Table of Contents

Image-to-image translation

5 Most Promising AI Models for Image Translation

Pix2Pix

Unsupervised Image to Image Translation (UNIT)

Palette

Vision Transformers (ViT)

TransGAN

Home

Technology peripherals

Five promising AI models for image translation

王林

Apr 23, 2023 am 10:55 AM

AI Neural Networks graphic design

Image-to-image translation

According to the definition provided by Solanki, Nayyar, and Naved in the paper, image-to-image translation is the process of converting images from one domain to another, with the goal of learning Mapping between input images and output images.

In other words, we hope that the model can transform one image a into another image b by learning the mapping function f.

用于图像翻译的五种最有前途的 AI 模型

Some people may wonder what the use of these models is and what relevance they have in the world of artificial intelligence. There tend to be many applications, and it's not just limited to art or graphic design. For example, being able to take an image and convert it into another image to create synthetic data (such as a segmented image) is very useful for training self-driving car models. Another tested application is map design, where the model is able to perform both transformations (satellite view to map and vice versa). Image flipping transformations can also be applied to architecture, with models making recommendations on how to complete unfinished projects.

One of the most compelling applications of image conversion is to transform a simple drawing into a beautiful landscape or painting.

5 Most Promising AI Models for Image Translation

Over the past few years, several methods have been developed to solve the problem of image-to-image translation by leveraging generative models. . The most commonly used methods are based on the following architecture:

Generative Adversarial Network (GAN)
Variational Autoencoder (VAE)
Diffusion Model (DVAE)
Transformers

Pix2Pix

Pix2Pix is a conditional GAN based model. This means that its architecture is composed of Generator network (G) and Discriminator (D). Both networks are trained in an adversarial game, where G's goal is to generate new images that are similar to the dataset, and D has to decide whether the image is generated (fake) or from the dataset (true).

The main differences between Pix2Pix and other GAN models are: (1) The first Generator takes images as input to start the generation process, while ordinary GANs use random noise; (2) Pix2Pix is a fully supervised model , which means that the dataset consists of pairs of images from two domains.

The architecture described in the paper is defined by a U-Net for the generator and a Markovian Discriminator or Patch Discriminator for the discriminator:

U-Net: by Composed of two modules (downsampling and upsampling). The input image is reduced to a set of smaller images (called feature maps) using convolutional layers, which are then upsampled via transposed convolutions until the original input dimensions are reached. There are skip connections between downsampling and upsampling.
Patch Discriminator: Convolutional network, its output is a matrix, where each element is the evaluation result of a part (patch) of the image. It includes the L1 distance between the generated and real images to ensure that the generator learns to map the correct function given the input image. Also called Markov because it relies on the assumption that pixels from different patches are independent.

用于图像翻译的五种最有前途的 AI 模型

Pix2Pix results

Unsupervised Image to Image Translation (UNIT)

In Pix2Pix, the training process is fully supervised (i.e. we need pairs of image inputs). The purpose of the UNIT method is to learn a function that maps image A to image B without training on two paired images.

The model starts by assuming that two domains (A and B) share a common latent space (Z). Intuitively, we can think of this latent space as an intermediate stage between image domains A and B. So, using the painting-to-image example, we can use the same latent space to generate a painting image backwards or to see a stunning image forward (see Figure X).

In the figure: (a) shared latent space. (b) UNIT architecture: X1 is a picture, , G2 generator, D1, D2 discriminator. Dashed lines represent shared layers between networks.

UNIT model is developed under a pair of VAE-GAN architecture (see above), where the last layer of the encoder (E1, E2) and the first layer of the generator (G1, G2) are shared.

用于图像翻译的五种最有前途的 AI 模型

UNIT results

Palette

Palette is a conditional diffusion model developed by the Canadian Google research team. The model is trained to perform 4 different tasks related to image conversion, resulting in high-quality results:

(i) Colorization: Adding color to grayscale images

(ii) Inpainting: Filling the user-specified image area with realistic content

(iii)Uncropping: Enlarging the image frame

(iv)JPEG Recovery: Recovering damaged JPEG images

In the paper, the authors explore the difference between a multi-task general model and multiple specialized models, both trained for one million iterations. The architecture of the model is based on the class conditional U-Net model of Dhariwal and Nichol 2021, using a batch size of 1024 images for 1M training steps. Preprocess and tune noise plans as hyperparameters, use different plans for training and prediction.

用于图像翻译的五种最有前途的 AI 模型

Palette results

Vision Transformers (ViT)

Please note that although the following two models are not specifically designed for image transformation , but they are a clear step forward in bringing powerful models such as transformers into the field of computer vision.

Vision Transformers (ViT) is a modification of the Transformers architecture (Vaswani et al., 2017) and was developed for image classification. The model takes an image as input and outputs the probability of belonging to each defined class.

The main problem is that Transformers are designed to take one-dimensional sequences as input, not two-dimensional matrices. For sorting, the authors recommend splitting the image into small chunks, thinking of the image as a sequence (or sentence in NLP) and the chunks as tokens (or words).

To briefly summarize, we can divide the whole process into 3 stages:

1) Embedding: split and flatten small pieces → apply linear transformation → add class tag (this tag will As an image summary considered when classifying)→Position Embedding

2) Transformer-Encoder block: Put the embedded patches into a series of transformer encoder blocks. The attention mechanism learns which parts of the image to focus on.

3) Classification MLP header: Pass the class tokens through the MLP header, which outputs the final probability that the image belongs to each class.

Advantages of using ViT: the arrangement remains unchanged. Compared to CNN, Transformer is not affected by translation (change in position of elements) in the image.

Disadvantages: A large amount of labeled data is required for training (at least 14M images)

TransGAN

TransGAN is a transform-based GAN model designed for image generation and does not use any convolutional layer. Instead, the generator and discriminator are composed of a series of Transformers connected by upsampling and downsampling blocks.

The forward pass of the generator takes a one-dimensional array of random noise samples and passes them through the MLP. Intuitively, we can think of the array as a sentence and the pixel values as words (note that an array of 64 elements can be reshaped into an 8✕8 image of 1 channel). Next, the author applies A series of Transformer blocks, each followed by an upsampling layer that doubles the size of the array (image).

A key feature of TransGAN is Grid-self-attention. When reaching high-dimensional images (i.e. very long arrays 32✕32 = 1024), applying the transformer can lead to explosive costs of the self-attention mechanism, since you need to compare each pixel of the 1024 array with all 255 possible pixels (RGB dimension). Therefore, instead of computing the correspondence between a given token and all other tokens, grid self-attention divides the full-dimensional feature map into several non-overlapping grids and computes the token interactions in each local grid .

The discriminator architecture is very similar to the ViT cited earlier.

用于图像翻译的五种最有前途的 AI 模型

TransGAN results on different datasets

The above is the detailed content of Five promising AI models for image translation. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks ago By DDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

2 weeks ago By DDD

Where to find the Crane Control Keycard in Atomfall

3 weeks ago By DDD

Assassin's Creed Shadows - How To Find The Blacksmith And Unlock Weapon And Armour Customisation

1 months ago By DDD

Roblox: Dead Rails - How To Complete Every Challenge

3 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7621

CakePHP Tutorial

1389

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

136

Related knowledge

Bytedance Cutting launches SVIP super membership: 499 yuan for continuous annual subscription, providing a variety of AI functions Jun 28, 2024 am 03:51 AM

This site reported on June 27 that Jianying is a video editing software developed by FaceMeng Technology, a subsidiary of ByteDance. It relies on the Douyin platform and basically produces short video content for users of the platform. It is compatible with iOS, Android, and Windows. , MacOS and other operating systems. Jianying officially announced the upgrade of its membership system and launched a new SVIP, which includes a variety of AI black technologies, such as intelligent translation, intelligent highlighting, intelligent packaging, digital human synthesis, etc. In terms of price, the monthly fee for clipping SVIP is 79 yuan, the annual fee is 599 yuan (note on this site: equivalent to 49.9 yuan per month), the continuous monthly subscription is 59 yuan per month, and the continuous annual subscription is 499 yuan per year (equivalent to 41.6 yuan per month) . In addition, the cut official also stated that in order to improve the user experience, those who have subscribed to the original VIP

Context-augmented AI coding assistant using Rag and Sem-Rag Jun 10, 2024 am 11:08 AM

Improve developer productivity, efficiency, and accuracy by incorporating retrieval-enhanced generation and semantic memory into AI coding assistants. Translated from EnhancingAICodingAssistantswithContextUsingRAGandSEM-RAG, author JanakiramMSV. While basic AI programming assistants are naturally helpful, they often fail to provide the most relevant and correct code suggestions because they rely on a general understanding of the software language and the most common patterns of writing software. The code generated by these coding assistants is suitable for solving the problems they are responsible for solving, but often does not conform to the coding standards, conventions and styles of the individual teams. This often results in suggestions that need to be modified or refined in order for the code to be accepted into the application

Can fine-tuning really allow LLM to learn new things: introducing new knowledge may make the model produce more hallucinations Jun 11, 2024 pm 03:57 PM

Large Language Models (LLMs) are trained on huge text databases, where they acquire large amounts of real-world knowledge. This knowledge is embedded into their parameters and can then be used when needed. The knowledge of these models is "reified" at the end of training. At the end of pre-training, the model actually stops learning. Align or fine-tune the model to learn how to leverage this knowledge and respond more naturally to user questions. But sometimes model knowledge is not enough, and although the model can access external content through RAG, it is considered beneficial to adapt the model to new domains through fine-tuning. This fine-tuning is performed using input from human annotators or other LLM creations, where the model encounters additional real-world knowledge and integrates it

Seven Cool GenAI & LLM Technical Interview Questions Jun 07, 2024 am 10:06 AM

To learn more about AIGC, please visit: 51CTOAI.x Community https://www.51cto.com/aigc/Translator|Jingyan Reviewer|Chonglou is different from the traditional question bank that can be seen everywhere on the Internet. These questions It requires thinking outside the box. Large Language Models (LLMs) are increasingly important in the fields of data science, generative artificial intelligence (GenAI), and artificial intelligence. These complex algorithms enhance human skills and drive efficiency and innovation in many industries, becoming the key for companies to remain competitive. LLM has a wide range of applications. It can be used in fields such as natural language processing, text generation, speech recognition and recommendation systems. By learning from large amounts of data, LLM is able to generate text

Five schools of machine learning you don't know about Jun 05, 2024 pm 08:51 PM

Machine learning is an important branch of artificial intelligence that gives computers the ability to learn from data and improve their capabilities without being explicitly programmed. Machine learning has a wide range of applications in various fields, from image recognition and natural language processing to recommendation systems and fraud detection, and it is changing the way we live. There are many different methods and theories in the field of machine learning, among which the five most influential methods are called the "Five Schools of Machine Learning". The five major schools are the symbolic school, the connectionist school, the evolutionary school, the Bayesian school and the analogy school. 1. Symbolism, also known as symbolism, emphasizes the use of symbols for logical reasoning and expression of knowledge. This school of thought believes that learning is a process of reverse deduction, through existing

To provide a new scientific and complex question answering benchmark and evaluation system for large models, UNSW, Argonne, University of Chicago and other institutions jointly launched the SciQAG framework Jul 25, 2024 am 06:42 AM

Editor |ScienceAI Question Answering (QA) data set plays a vital role in promoting natural language processing (NLP) research. High-quality QA data sets can not only be used to fine-tune models, but also effectively evaluate the capabilities of large language models (LLM), especially the ability to understand and reason about scientific knowledge. Although there are currently many scientific QA data sets covering medicine, chemistry, biology and other fields, these data sets still have some shortcomings. First, the data form is relatively simple, most of which are multiple-choice questions. They are easy to evaluate, but limit the model's answer selection range and cannot fully test the model's ability to answer scientific questions. In contrast, open-ended Q&A

SOTA performance, Xiamen multi-modal protein-ligand affinity prediction AI method, combines molecular surface information for the first time Jul 17, 2024 pm 06:37 PM

Editor | KX In the field of drug research and development, accurately and effectively predicting the binding affinity of proteins and ligands is crucial for drug screening and optimization. However, current studies do not take into account the important role of molecular surface information in protein-ligand interactions. Based on this, researchers from Xiamen University proposed a novel multi-modal feature extraction (MFE) framework, which for the first time combines information on protein surface, 3D structure and sequence, and uses a cross-attention mechanism to compare different modalities. feature alignment. Experimental results demonstrate that this method achieves state-of-the-art performance in predicting protein-ligand binding affinities. Furthermore, ablation studies demonstrate the effectiveness and necessity of protein surface information and multimodal feature alignment within this framework. Related research begins with "S

SK Hynix will display new AI-related products on August 6: 12-layer HBM3E, 321-high NAND, etc. Aug 01, 2024 pm 09:40 PM

According to news from this site on August 1, SK Hynix released a blog post today (August 1), announcing that it will attend the Global Semiconductor Memory Summit FMS2024 to be held in Santa Clara, California, USA from August 6 to 8, showcasing many new technologies. generation product. Introduction to the Future Memory and Storage Summit (FutureMemoryandStorage), formerly the Flash Memory Summit (FlashMemorySummit) mainly for NAND suppliers, in the context of increasing attention to artificial intelligence technology, this year was renamed the Future Memory and Storage Summit (FutureMemoryandStorage) to invite DRAM and storage vendors and many more players. New product SK hynix launched last year

See all articles