Recently, Peking University, Stanford, and the popular Pika Labs jointly published a study that takes the capabilities of large-model Vincentian graphs to a new level.
Paper address: https://arxiv.org/pdf/2401.11708.pdf
Code Address: https://github.com/YangLing0818/RPG-DiffusionMaster
The author of the paper proposed an innovative method, using the reasoning capabilities of the multi-modal large language model (MLLM), to improve the text-to-image generation/editing framework.
In other words, this method aims to improve the performance of text generation models when processing complex text prompts containing multiple attributes, relationships, and objects.
Without further ado, here’s the picture:
A green twintail girl in orange dress is sitting on the sofa while a messy desk under a big window on the left, a lively aquarium is on the top right of the sofa, realistic style.
A wearing orange dress girl with twin tails is sitting on the sofa, next to the big window is a messy desk, with a lively aquarium on the upper right, room style realism.
# Faced with multiple objects with complex relationships, the structure of the entire picture and the relationship between people and objects given by the model are very reasonable, making the viewer's eyes shine.
And for the same prompt, let’s take a look at the performance of the current state-of-the-art SDXL and DALL·E 3:
Let’s take a look at the performance of the new framework when binding multiple properties to multiple objects:
From left to right, a blonde ponytail European girl in white shirt, a brown curly hair African girl in blue shirt printed with a bird, an Asian young man with black short hair in suit are walking in the campus happily.
From left to right, a European girl wearing a white shirt with a blond ponytail, an African girl with brown curly hair wearing a blue shirt with a bird printed on it, and an Asian girl wearing a suit with short black hair. Young people are walking happily on campus.
The researchers named this framework RPG (Recaption, Plan and Generate), using MLLM as the global planner to decompose the complex image generation process into multiple sub-regions. A simpler build task.
The paper proposes complementary regional diffusion to achieve regional combination generation, and also integrates text-guided image generation and editing into the RPG framework in a closed-loop manner , thus enhancing the generalization ability.
Experiments show that the RPG framework proposed in this article is better than the current state-of-the-art text image diffusion models, including DALL·E 3 and SDXL, especially in multi-category object synthesis and text image semantics Alignment aspect.
It is worth noting that the RPG framework is widely compatible with various MLLM architectures (such as MiniGPT-4) and diffusion backbone networks (such as ControlNet).
RPG
#The current Vincentian graph model mainly has two problems: 1. Layout-based or attention-based methods can only provide rough spatial guidance and are difficult to Handle overlapping objects; 2. Feedback-based methods require collecting high-quality feedback data and incur additional training costs.
In order to solve these problems, researchers proposed three core strategies of RPG, as shown in the figure below:
Given a complex text prompt containing multiple entities and relationships, MLLM is first used to decompose it into basic prompts and highly descriptive sub-prompts; subsequently, the CoT planning of the multi-modal model is used to divide the image space into Complementary sub-regions; finally, complementary region diffusion is introduced to generate images of each sub-region independently and aggregate at each sampling step.
Convert textual cues into highly descriptive cues, providing information-enhanced cue understanding and semantic alignment in diffusion models.
Use MLLM to identify key phrases in user prompt y and obtain the sub-items:
# #Use LLM to decompose the text prompt into different sub-prompts and redescribe them in more detail:
In this way, you can Generate denser fine-grained details for each sub-cue to effectively increase the fidelity of the generated images and reduce the semantic differences between cues and images.
Divide the image space into complementary sub-regions and assign different sub-prompts while breaking down the build task into multiple simpler sub-tasks.
Specifically, the image space H × W is divided into several complementary regions, and each enhancer prompt is assigned to a specific region R:
Use MLLM’s powerful thinking chain reasoning capabilities to carry out effective regional division. By analyzing the retrieved intermediate results, detailed principles and precise instructions can be generated for subsequent image synthesis.
In each rectangular sub-area, content guided by sub-cues is independently generated and subsequently resized and connected. , spatially merge these sub-regions.
This method effectively solves the problem of large models having difficulty processing overlapping objects. Furthermore, the paper extends this framework to adapt to editing tasks, employing contour-based region diffusion to precisely operate on inconsistent regions that need modification.
As shown in the image above. In the retelling stage, RPG uses MLLM as subtitles to retell the source image, and uses its powerful reasoning capabilities to identify fine-grained semantic differences between the image and the target cue, directly analyzing how the input image aligns with the target cue.
Use MLLM (GPT-4, Gemini Pro, etc.) to check differences between input and target regarding numerical accuracy, property bindings, and object relationships. The resulting multimodal understanding feedback will be delivered to the MLLM for inferential editing planning.
Let’s take a look at the performance of the generation effect in the above three aspects. The first is attribute binding, comparing SDXL, DALL·E 3 and LMD:
We can see that across all three tests, only the RPG most accurately reflects what the prompts describe.
Then there is numerical accuracy, the display order is the same as above (SDXL, DALL·E 3, LMD, RPG):
——I didn’t expect that counting would be quite difficult for the large model of Vincent. The RPG easily defeated the opponent.
The last item is the complex relationship in the restore prompt:
In addition, you can also Diffusion expands into a hierarchical format, dividing a specific sub-region into smaller sub-regions.
As shown in the figure below, when adding a hierarchy of region segmentation, RPG can achieve significant improvements in text-to-image generation. This provides a new perspective for handling complex generation tasks, making it possible to generate images of arbitrary composition.
The above is the detailed content of Vincent Tu's new SOTA! Pika, Peking University and Stanford jointly launch RPG, multi-modal to help solve two major problems of Wenshengtu. For more information, please follow other related articles on the PHP Chinese website!