There are many methods of high-quality image editing, but it is difficult to accurately express the real physical world.
So, try Edit the World.
Picture
Peking University, Tiamat AI, Tiangong AI, and Mila Labs proposed EditWorld, which introduced a new editing task, namely World-instructed image editing. It defines and categorizes instructions based on various world scenarios.
Picture
With the support of a set of pre-trained models, such as GPT-3.5, Video-LLava and SDXL, a world command is built multimodal data set.
A diffusion-based image editing model EditWorld was trained on this data set, and the result was that the performance on its new task was significantly better than the existing editing methods, achieving SOTA.
Existing methods achieve high-quality image editing through a variety of ways, including but not limited to text control, dragging operations, and inpainting. Among them, the method of editing using instructions has received widespread attention due to its ease of use.
Although image editing methods are capable of producing high-quality results, they still have difficulties in handling world dynamics that convey true visual dynamics in the physical world.
As shown in Figure 1, neither InstructPix2pix nor MagicBrush can generate reasonable editing results.
Picture
To solve this problem, the team introduced a new task called world-instructed image editing to enable image editing to reflect “World Dynamics” in the Real Physical World and Virtual Media.
Specifically, they defined and classified various world dynamic instructions and created a new multi-modal training dataset based on these instructions, which contains a large number of input-instruction-output triples Group.
Finally, the team trained a text-guided diffusion model using a carefully crafted dataset and proposed a zero-shot image manipulation strategy to achieve world-instructed image editing.
Based on task scenarios in the real world and virtual media, world-instructed image editing is divided into 7 categories, each category is defined and introduced, and a data sample is provided.
Picture
The team then designed two branches: text-to-picture generation and video storyboard extraction to obtain the data set.
The text generation image branch is to enrich the richness of the data scene. Under this branch, the team first uses GPT to generate text quadruples (including input image description, instruction, output image description and keywords), and then Use the input and output descriptions to generate pictures corresponding to the text, and use the attention map corresponding to the keyword to locate the editing position and obtain the editing mask. At the same time, in order to ensure the consistency of the key features of the two pictures, the team introduced the method of image prompt adaption. IP-Adapter. Finally, the team used IP-Adapter and ControlNet, combined with the canny map of the output image and the image prompt feature of the input image, and used Image Inpainting to adjust the output image to obtain more effective editing data.
Picture
After using the text generation picture branch to obtain scene-rich data, in order to add real data to the data set, the team extracted high-quality data from the video keyframes as editing data. Specifically, the team extracted two frames with strong correlation and large structural differences from the video storyboard as the starting and last frames, and cut out a new storyboard, and used a large multi-modal model to change the storyboard. After describing, the team finally used the start and end frames as the input image and output image, and used the obtained description as the instruction, thus obtaining the required editing data.
Going a step further, the team uses manual rechecking of the generated data to further improve data quality.
The team used the data set to finetune the InstructPix2Pix model. At the same time, in order to protect the non-editing area and achieve more precise editing, the team proposed a post-edit strategy.
Picture
Picture
Finally it can be seen that the team’s approach can work well to achieve world- instructed image editing.
Paper link:
https://www.php.cn/link/154d7da9e669c75ee317d46614381dd8
Code link:
https://www.php .cn/link/e6da32eef072f987685b6eddca072d4f
The above is the detailed content of Generate dataset with GPT-3.5! New SOTA for image editing by Peking University Tiangong and other teams can accurately simulate physical world scenes. For more information, please follow other related articles on the PHP Chinese website!