In common image editing operations, image synthesis refers to the process of combining the foreground object of one picture with another background picture to generate a composite picture. The visual effect of the synthesized image is similar to transferring foreground objects from one picture to another background picture, as shown in the figure below
Image synthesis in artistic creation , poster design, e-commerce, virtual reality, data augmentation and other fields are widely used
The composite image obtained by simple cut and paste may have many problems. In previous research work, image synthesis derived different subtasks to solve different subproblems respectively. Image blending, for example, aims to resolve unnatural borders between foreground and background. Image harmonization aims to adjust the lighting of the foreground so that it harmonizes with the background. Perspective adjustment aims to adjust the pose of the foreground so that it matches the background. Object placement aims to predict the appropriate location, size, and perspective angle for foreground objects. Shadow generation aims to generate reasonable shadows for foreground objects on the background
As shown in the figure below, previous research work performed the above subtasks in a serial or parallel manner to obtain realistic and natural synthetic images. In the serial framework, we can selectively execute some subtasks according to actual needs
In the parallel framework, the currently popular method is to use the diffusion model. It accepts a background image with a foreground bounding box and a foreground object image as input and directly generates the final composite image. This can make foreground objects and background images seamlessly blended, lighting and shadow effects are reasonable, and postures are adapted to the background.
This parallel framework is equivalent to executing multiple subtasks at the same time, and cannot selectively execute some subtasks. It is not controllable and may bring unnecessary or unreasonable changes to the posture or color of foreground objects
What needs to be rewritten is:
In order to enhance the controllability of the parallel framework and selectively perform some sub-tasks, we proposed the controllable image composition model Controlable Image Composition (ControlCom). As shown in the figure below, we use an indicator vector as the condition information of the diffusion model to control the properties of the foreground objects in the composite image. The indication vector is a two-dimensional binary vector, in which each dimension controls whether to adjust the lighting attributes and posture attributes of the foreground object respectively, where 1 means adjustment and 0 means retention. Specifically, (0,0 ) means that it neither changes the foreground illumination nor the foreground posture, but just seamlessly blends the object into the background image, which is equivalent to image blending. (1,0) means only changing the foreground lighting to make it harmonious with the background and retaining the foreground posture, which is equivalent to image harmonization. (0,1) means only changing the foreground pose to match the background and retaining the foreground illumination, which is equivalent to view synthesis. (1,1) means changing the illumination and posture of the foreground at the same time, which is equivalent to the current uncontrollable parallel image synthesis
We incorporate four tasks into the same framework and implement a four-in-one object portal through indicator vectors function that can transport objects to specified locations in the scene. This work is a collaboration between Shanghai Jiao Tong University and Ant Group. The code and model will be open source soon
Code model link: https://github.com/bcmi/ControlCom-Image-Composition
In the figure below, we show the function of controllable image composition
In the column on the right, the lighting of the foreground object is supposed to be the same as the background lighting. Previous methods may cause unexpected changes in the color of foreground objects, such as vehicles and clothing. Our method (version 0.1) is able to preserve the color of a foreground object while simultaneously adjusting its pose so that it blends naturally into the background image
Next, we show more results for four versions of our method (0,0), (1,0), (0,1), (1,1). It can be seen that when using different indicator vectors, our method can selectively adjust some attributes of foreground objects, effectively control the effect of the composite image, and meet the different needs of users.
What we need to rewrite is: What is the model structure that can realize the four functions? Our method adopts the following model structure. The input of the model includes background images with foreground bounding boxes and foreground object images. The features and indicator vectors of the foreground objects are combined into the diffusion model
We re-extract the foreground Global features and local features of the object, and first fuse global features and then local features. During the local fusion process, we use aligned foreground feature maps for feature modulation to achieve better detail preservation. At the same time, indicator vectors are used in both global fusion and local fusion to more fully control the properties of foreground objects
We use the pre-trained stable diffusion algorithm to train the model based on 1.9 million images from OpenImage. In order to train four subtasks simultaneously, we designed a set of data processing and enhancement processes. For details on the data and training, see the paper
We tested on the COCOEE dataset and a dataset we built ourselves. Since previous methods can only achieve uncontrollable image synthesis, we compared with the (1,1) version and previous methods. The comparison results are shown in the figure below. PCTNet is an image harmonization method that can preserve the details of objects, but cannot adjust the posture of the foreground, nor can it complete the foreground objects. Other methods can generate the same kind of objects, but are less effective at retaining details, such as the style of clothes, the texture of cups, the color of bird feathers, etc.
Our method is better in comparison. Preserve the details of foreground objects, complete incomplete foreground objects, and adjust the lighting, posture and adaptation of foreground objects to the background
This work is for controllable This is the first attempt at image synthesis. The task is very difficult and there are still many shortcomings. The performance of the model is not stable and robust enough. In addition, in addition to lighting and pose, the attributes of foreground objects can be further refined. How to achieve finer-grained controllable image synthesis is a more challenging task
In order to keep the original intention Changes, the content that needs to be rewritten is: Reference
Yang, Gu, Zhang, Zhang, Chen, Sun, Chen, Wen (2023). Example-based image editing and diffusion models. In CVPR
[2] Song Yongzhong, Zhang Zhi, Lin Zhilong, Cohen, S. D., Price, B. L., Zhang Jing, Jin Suying, Arriaga, D. G. 2023. ObjectStitch: Generative object synthesis. In CVPR
The above is the detailed content of 'Scene Control Portal: Four-in-one Object Teleportation, Submitted & Ant Produced'. For more information, please follow other related articles on the PHP Chinese website!