It’s already 2022, but most current computer vision tasks still only focus on image perception. For example, the image classification task only requires the model to identify the object categories in the image. Although tasks such as target detection and image segmentation further require finding the location of objects, such tasks are still not enough to demonstrate that the model has obtained a comprehensive and in-depth understanding of the scene.
The following Figure 1 is an example. If the computer vision model only detects people, elephants, fences, trees, etc. in the picture, we usually do not think that the model has understood the picture, but The model is also unable to make more advanced decisions based on understanding, such as issuing a "no feeding" warning.
##Figure 1: Original example diagram
In fact, in wisdom In many real-world AI scenarios such as cities, autonomous driving, and smart manufacturing, in addition to locating targets in the scene, we usually also expect the model to reason and predict the relationship between various subjects in the image. For example, in autonomous driving applications, autonomous vehicles need to analyze whether pedestrians on the roadside are pushing a cart or riding a bicycle. Depending on the situation, the corresponding subsequent decisions may be different.
In a smart factory scenario, judging whether the operator is operating safely and correctly also requires the monitoring-side model to have the ability to understand the relationship between subjects. Most existing methods manually set some hard-coded rules. This makes the model lack generalization and difficult to adapt to other specific situations.
The scene graph generation task (scene graph generation, or SGG) is designed to solve the above problems. In addition to the requirements for classifying and locating target objects, the SGG task also requires the model to predict the relationship between objects (see Figure 2).
Figure 2: Scene graph generation
Traditional scene graph generation task Datasets typically have bounding box annotations of objects and annotation of relationships between bounding boxes. However, this setting has several inherent flaws:
(1) Bounding boxes cannot accurately locate objects: As shown in Figure 2, bounding boxes inevitably fail when labeling people. Contains objects around people;
(2) The background cannot be marked: As shown in Figure 2, the trees behind the elephant are marked with a bounding box, which almost covers the entire image, so it involves The relationship between the background cannot be accurately annotated, which also makes it impossible for the scene graph to completely cover the image and achieve comprehensive scene understanding.
Therefore, the author proposes the full scene graph generation (PSG) task with a finely annotated large-scale PSG data set.
Figure 3: Full scene graph generationAs shown in Figure 3, this task uses panoramic segmentation to achieve comprehensive and accurate positioning objects and backgrounds, thereby addressing the inherent shortcomings of the scene graph generation task, thereby advancing the field toward comprehensive and deep scene understanding.
Paper informationPaper link: https://arxiv.org/abs/2207.11247Project Page: https ://psgdataset.org/OpenPSG Codebase: https://github.com/Jingkang50/OpenPSGCompetition Link: https://www.cvmart.net/race/10349/baseECCV'22 SenseHuman Workshop Link: https://sense- human.github.io/HuggingFace Demo Link: https://huggingface.co/spaces/ECCV2022/PSG
The PSG data set proposed by the author contains nearly 50,000 images of coco, and is based on coco's existing panoramic segmentation annotation, marking the relationship between segmented blocks. The author carefully defines 56 kinds of relationships, including positional relationships (over, in front of, etc.), common relationships between objects (hanging from, etc.), common biological actions (walking on, standing on, etc.), human behaviors (cooking, etc.), relationships in traffic scenes (driving, riding, etc.), relationships in motion scenes (kicking, etc.), and relationships between backgrounds (enclosing, etc.). The author requires annotators to use more precise verb expressions rather than more vague expressions, and to annotate the relationships in the diagram as fully as possible.
##PSG model effect display
Task advantagesThe author once again understands the advantages of the Full Scene Graph Generation (PSG) task through the example below:
The left picture comes from the traditional data of the SGG task Set Visual Genome (VG-150). It can be seen that annotations based on detection frames are usually inaccurate, and the pixels covered by the detection frames cannot accurately locate objects, especially backgrounds such as chairs and trees. At the same time, relationship annotation based on detection frames usually tends to label some boring relationships, such as "people have heads" and "people wear clothes".
In contrast, the PSG task proposed in the right figure provides more comprehensive (including interaction of foreground and background), clearer (appropriate object granularity) and more accurate (pixel level of accuracy) scene graph representation to advance the field of scene understanding.
Two major types of PSG modelsIn order to support the proposed PSG task, the author built an open source code platform OpenPSG, which implemented four two-stage methods and two A single-stage method is convenient for everyone to develop, use, and analyze.
The two-stage method uses Panoptic-FPN to perform panoramic segmentation of the image in the first stage.
Next, the author extracts the features of the objects obtained by panoramic segmentation and the relationship features of each pair of object fusions, and sends them to the next stage of relationship prediction. The framework has integrated and reproduced the classic methods of traditional scene graph generation IMP, VCTree, Motifs, and GPSNet.
PSGFormer is a single-stage method based on dual decoder DETR. The model first extracts image features through the convolutional neural network backbone in a) and adds position coding information as the input of the encoder. At the same time, it initializes a set of queries to represent triples. Similar to DETR, in b) the model inputs the output of the encoder as key and value together with the queries representing triples into the decoder for cross-attention operation. Then the model inputs each decoded query into the prediction module corresponding to the subject-verb-object triplet in c), and finally obtains the corresponding triplet prediction result.
PSGFormer is a single-stage method of DETR based on double decode. The model a) extracts image features through CNN, inputs position encoding information into the encoder, and initializes two sets of queries to represent objects and relationships respectively. Then in step b), the model learns object query and relation query through cross-attention decoding in the object decoder and relation encoder respectively based on the image information encoded by the encoder.
After both types of queries are learned, they are matched through mapping in c) to obtain paired triple queries. Finally, in d), predictions about the object query and relationship query are completed through the prediction head, and the final triple prediction result is obtained based on the matching results in c).
PSGTR and PSGFormer are both expanded and improved models based on DETR. The difference is that PSGTR uses a set of queries to directly model triples, while PSGFormer uses two sets of queries to model objects and Regarding relationship modeling, both methods have their own pros and cons. For details, please refer to the experimental results in the paper.
Most of the methods that are effective on SGG tasks are still effective on PSG tasks. However, some methods that utilize strong statistical priors on the data set or priors on the predicate direction in subject, predicate, and object may not be so effective. This may be due to the fact that the bias of the PSG data set is not so serious compared to the traditional VG data set, and the definition of predicate verbs is clearer and learnable. Therefore, the authors hope that subsequent methods will focus on the extraction of visual information and the understanding of the image itself. Statistical priors may be effective in brushing data sets, but they are not essential.
Compared with the two-stage model, the single-stage model can currently achieve better results. This may be due to the fact that the supervision signal about the relationship in the single-stage model can be directly transferred to the feature map, so that the relationship signal participates in more model learning, which is beneficial to the capture of relationships. However, since this article only proposes several baseline models and does not optimize single-stage or two-stage models, it cannot be said that the single-stage model is necessarily stronger than the two-stage model. This also hopes that the contestants will continue to explore.
Compared with the traditional SGG task, the PSG task performs relationship matching based on the panoramic segmentation map and requires confirmation of the ID of the subject and object objects in each relationship. Compared with the two-stage direct prediction of the panoramic segmentation map to complete the division of object IDs, the single-stage model needs to complete this step through a series of post-processing. If the existing single-stage model is further improved and upgraded, how to more effectively complete the confirmation of object IDs in the single-stage model and generate better panoramic segmentation images is still a topic worth exploring.
Finally, everyone is welcome to try HuggingFace:
Demo: https://huggingface .co/spaces/ECCV2022/PSG
Recently popular text input-based generative models (such as DALL-E 2) It’s really amazing, but some research shows that these generative models may just glue together several entities in the text, without even understanding the spatial relationships expressed in the text. As shown below, although the input is "cup on spoon", the generated pictures are still "spoon on cup".
By coincidence, the PSG data set is marked with a mask-based scene graph relationship. The author can use scene graph and panoramic segmentation mask as a training pair to obtain a text2mask model, and generate more detailed pictures based on the mask. Therefore, it is possible that the PSG dataset also provides a potential solution for relationship-focused image generation.
P.S. The "PSG Challenge", which aims to encourage the field to jointly explore comprehensive scene recognition, is in full swing. Millions of prizes are waiting for you! Competition Link: https://www.cvmart.net/race/10349/base
The above is the detailed content of Nanyang Polytechnic proposed the task of generating PSG from a full scene graph, locating objects at the pixel level and predicting 56 relationships.. For more information, please follow other related articles on the PHP Chinese website!