The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com
The authors of this paper are all from the team of Professor Li Xi of Zhejiang University. The first author of the paper is doctoral student Su Wei, and the corresponding author is Li Professor Xi (IET Fellow, National Distinguished Young Scholar). In recent years, Professor Li Xi’s team has published more than 180 CV/AIGC-related research works in authoritative international journals (such as TPAMI, IJCV, etc.) and top international academic conferences (ICCV, CVPR, ECCV, etc.), and has cooperated with well-known universities and scientific research institutions at home and abroad. Institutions cooperate extensively. As a basic visual language task, referring expression comprehension (REC) locates the referred target in the image based on natural language description. The REC model usually consists of three parts: visual encoder, text encoder and cross-modal interaction, which are used to extract visual features, text features and cross-modal feature feature interaction and enhancement respectively. Most current research focuses on designing efficient cross-modal interaction modules to improve task accuracy, and lacks exploration of visual encoders. A common approach is to use feature extractors pre-trained on classification and detection tasks, such as ResNet, DarkNet, Swin Transformer or ViT, etc. These models traverse all spatial locations of the image to extract features in a sliding window or divided patch manner. Their computational complexity will increase rapidly with the image resolution, which is more obvious in Transformer-based models. Due to the spatial redundancy characteristics of images, there are a large number of background areas with low information content and areas unrelated to the referential expression in the image. Extracting features in these areas in the same way will increase the amount of calculation but is not effective for effective feature extraction. Nothing helps. A more efficient way is to predict the text relevance and content richness of the image area in advance, fully extract features from the text-related foreground area, and roughly extract features from the background area. For regional prediction, a more intuitive way is to use the image pyramid to identify the background area in advance in the coarse-grained image at the top of the pyramid, and then gradually add high-resolution fine-grained foreground areas. Based on the above analysis, we proposed a coarse-to-fine iterative perception framework ScanFormer, which scans layer by layer in the image pyramid, starting from low-resolution coarse-scale images, and gradually filtering out irrelevant referential expressions / background area to reduce computational waste and allow the model to focus more on the foreground/task-related area.
- Paper title: ScanFormer: Referring Expression Comprehension by Iteratively Scanning
- Paper link: https://arxiv.org/ pdf/2406.18048
#🎜 🎜## 🎜🎜#
1. Coarse-to-fine iteration perception framework To simplify the structure, we adopt the ViLT [1] model that unifies text and visual modalities, and divides it into two parts, Encoder1 and Encoder2, along the depth dimension for different tasks.
First, extract text features and store them in KV Cache; then construct an image pyramid and iterate downwards from the top of the pyramid, in each iteration , input the patch selected at the current scale, and Encoder1 is used to predict the selection of fine-grained patches at the next scale corresponding to each patch. In particular, all patches of the top-level image are selected to ensure that the model can obtain coarse-grained Full image information. Encoder2 further extracts features and predicts the bounding box at that scale based on the [cls] token of the current scale.
At the same time, the intermediate features of Encoder1 and Encoder2 will be stored in the KV Cache to facilitate subsequent scale utilization. As the scale increases, fine-grained features are introduced, and position prediction becomes more accurate, while most irrelevant patches are discarded to save a lot of computation.
In addition, the patches inside each scale have bidirectional attention, and will also pay attention to all patch and text features of the previous scale. This causal attention across scales can further reduce computational requirements.
2. Dynamic patch selection Selection of each patch The situation is determined by the selection factor generated by the previous scale. There are two options for the application location. One is used in all heads of each layer of MHSA in the Encoder. However, for the Encoder of N layers of H heads, it is difficult to obtain effective The gradient information is used to update, so the learned selection factor is not ideal; the second is directly used as the input of the Encoder, that is, patch embedding. Since it is only used in this position, it is easier to learn. This article finally adopted this solution. .
In addition, it should be noted that even if the input patch embedding is set to 0, due to the existence of MHSA and FFN, the features of the patch in subsequent layers are still will become non-zero and affect the characteristics of the remaining patches. Fortunately, when there are many identical tokens in the token sequence, the calculation of MHSA can be simplified and actual inference acceleration can be achieved. In addition, in order to enhance the flexibility of the model, this article does not directly set patch embedding to 0, but replaces it with a learnable constant token. Therefore, the patch selection problem is converted into a patch replacement problem. The process of patch selection can be decomposed into two steps: constant token replacement and token merging. Unselected patches will be replaced with the same constant token. Since these unselected tokens are the same, according to the calculation method of scaled dot product attention, these tokens can be combined into one token and multiplied by the total number, which is equivalent to adding to the dimension, so the dot product attention method is calculated. No change, common acceleration methods are still available.
This method achieves performance similar to state-of-the-art on four data sets: RefCOCO, RefCOCO+, RefCOCOg and ReferItGame. By pre-training on large-scale data sets and fine-tuning on specific data sets, the performance of the model can be further greatly improved and achieve similar results to pre-trained models such as MDETR [2] and OFA [3].
In terms of inference speed, the proposed method achieves real-time inference speed while ensuring Higher task accuracy.
In addition, the experimental part also made statistics on the patch selection of the model and the distribution of positioning accuracy at each scale (scale1 and scale2). As shown in the picture on the left, as the scale increases, fine-grained image features are added, and the model accuracy gradually improves. Therefore, you can try to add an early exit mechanism to exit in time when the positioning accuracy meets the requirements, avoiding further calculations on high-resolution images, and achieving the effect of adaptively selecting an appropriate resolution based on samples. This article also made some preliminary attempts, including adding prediction branches such as IoU, GIoU and uncertainty, and returning early exit indicators. However, it was found that the effect was not ideal. How to design appropriate and accurate early exit indicators needs to be continued to explore. The picture on the right shows the patch selection situation at different scales. At all scales, the selected patches are relatively small, and most of the patches can be eliminated, so computing resources can be effectively saved. For each sample (image + referential expression), the number of actually selected patches is relatively small, perhaps 65% of the total.
Finally, the experimental part shows some visualization results. As the scale increases (red → green → blue), the positioning accuracy of the model gradually improves. In addition, according to the image reconstructed from the selected patch, it can be seen that the model only pays attention to coarse-scale information for the background area, and for the relevant foreground area, the model can pay attention to fine-grained detailed information.
[1].Kim W, Son B, Kim I. Vilt: Vision-and-language transformer without convolution or region supervision [C]//International conference on machine learning. PMLR, 2021: 5583-5594.[2].Kamath A, Singh M, LeCun Y, et al. Mdetr-modulated detection for end-to-end multi-modal understanding [C]//Proceedings of the IEEE/CVF international conference on computer vision. 2021: 1780-1790.[3].Wang P, Yang A, Men R, et al. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to -sequence learning framework [C]//International conference on machine learning. PMLR, 2022: 23318-23340.The above is the detailed content of Zhejiang University Li Xi's team: A new method for referring to expression understanding, ScanFormer iterates from coarse to fine to eliminate visual redundancy. For more information, please follow other related articles on the PHP Chinese website!