In order to achieve high-precision regional-level multi-modal understanding, this paper proposes a dynamic resolution scheme to simulate the human visual cognitive system.
The author of this article is from the LAMP Laboratory of the University of Chinese Academy of Sciences. The first author Zhao Yuzhong is a doctoral student of the University of Chinese Academy of Sciences in 2023, and the co-author Liu Feng is a direct doctoral student of the University of Chinese Academy of Sciences in 2020. Their main research directions are visual language models and visual object perception.
DynRefer significantly improves regional-level multi-modal recognition capabilities by simulating the human visual cognitive process. By introducing the dynamic resolution mechanism of the human eye, DynRefer can simultaneously complete the tasks of region recognition, region attribute detection and region-level captioning with a single model, and achieve SOTA performance in all the above tasks. Among them, 115.7 CIDEr was achieved on the region-level captioning task of the RefCOCOg data set, which is significantly higher than the CVPR 2024 methods such as RegionGPT, GlaMM, Osprey, Alpha-CLIP and so on.
- Paper title: DynRefer: Delving into Region-level Multi-modality Tasks via Dynamic Resolution
- Paper link: https://arxiv.org/abs/2405.16071
- Paper code: https ://github.com/callsys/DynRefer
The region-level multi-modal task is dedicated to converting specified image regions into language descriptions consistent with human preferences. Humans have a resolution-adaptive ability when completing regional-level multi-modal tasks, that is, the area of interest is high-resolution, and the non-attention area is low-resolution. However, current regional-level multi-modal large language models often adopt a fixed-resolution encoding scheme, that is, encoding the entire image, and then extracting regional features through RoI Align. This approach lacks the resolution adaptive capability in the human visual cognitive system, and has low encoding efficiency and ability for areas of interest. In order to achieve high-precision regional-level multi-modal understanding, we propose a dynamic resolution scheme to simulate the human visual cognitive system, as shown in the figure below. 区 Figure 1: Comparison of traditional regional multi -modal methods (left) and Dynrefer method (right).
1. Simulate dynamic resolution image (Multi-view construction). Since the mainstream pre-trained visual language model (CLIP) can only receive uniform resolution input, we simulate a dynamic resolution image by constructing multiple uniform resolution views. The image has high resolution in the referent area and low resolution in the non-reference area. The specific process is shown in Figure 2. The original image x is cropped and resized into multiple candidate views. The cropping area is calculated as
, where . Here represents the bounding box of the reference area, represents the size of the entire image, and t represents the interpolation coefficient. During training, we randomly select n views from candidate views to simulate images generated due to gaze and rapid eye movements. These n views correspond to the interpolation coefficient t, which is . We fixedly retain the view containing only the reference region (i.e. ). This view has been experimentally proven to help preserve regional details, which is crucial for all regional multi-modal tasks. Figure 2: DynRefer training (top) and inference (bottom). 2. Stochastic Multi-view Embedding. The specific process is shown in Figure 3. The sampled n views are encoded into spatial features via frozen CLIP and then processed by the RoI-Align module to obtain region embeddings, i.e., . This is shown on the left side of Figure 3. These region embeddings are not spatially aligned due to spatial errors introduced by cropping, resizing, and RoI-Align. Inspired by the deformable convolution operation, we propose an alignment module to reduce the bias by aligning to , where is the region embedding of the view encoding containing only the reference region. For each region embedding , it is first concatenated with and then a 2D offset map is calculated through a convolutional layer. The spatial features of are then resampled based on the 2D offset. Finally, the aligned region embeddings are concatenated along the channel dimension and fused through linear layers. The output is further compressed through a visual resampling module, i.e. Q-former, which extracts a regional representation of the reference region of the original image x ( in Figure 3).
Figure 3: DynRefer network structure 3. Vision-language Alignment. The region representation computed by the stochastic multi-view embedding module is decoded by three decoders as shown in Figure 3 (right), respectively supervised by three multi-modal tasks: i ) Image region label generation. We employ a lightweight query-based recognition decoder for region label generation. The decoder is shown in Figure 3 (right). The tagging process is completed by calculating the confidence of a predefined tag using the tag as query, as key and value. We parse labels from ground-truth subtitles to supervise the recognition decoder. ii) Region-text contrastive learning. Similar to the region tag decoder, the decoder is defined as a query-based recognition decoder. The decoder computes similarity scores between subtitles and region features, supervised using SigLIP loss. iii) Language modeling. We use a pre-trained large language model to convert the regional representation into a language description.
Figure 4: Performance of dual-view (n=2) DynRefer model on region-level multi-modal tasks. Under different interpolation coefficients t, . View one is fixed (), view two is randomly selected or fixed. 4. During the inference process, the trained DynRefer model performs multi-modal tasks on images with dynamic resolution. By adjusting the interpolation coefficients of the sampled n views, we can obtain a regional representation with dynamic resolution characteristics. To evaluate the properties at different dynamic resolutions, we trained a dual-view (n=2) DynRefer model and evaluated it on four multi-modal tasks. As can be seen from the curves in Figure 4, attribute detection achieves better results for views without contextual information (). This can be explained by the fact that such tasks often require detailed regional information. For Region-level captioning and Dense captioning tasks, a context-rich view () is required to fully understand the reference region. It is important to note that views with too much context () degrade performance on all tasks because they introduce too much region-irrelevant information. When the task type is known, we can sample appropriate views based on task characteristics. When the task type is unknown, we first construct a set of candidate views under different interpolation coefficients t, . From the candidate set, n views are sampled via a greedy search algorithm. The objective function of the search is defined as: where represents the interpolation coefficient of the i-th view, represents the i-th view, pHASH (・) represents the perceptual image hash function, and represents the XOR operation. In order to compare the information of views from a global perspective, we use the "pHASH (・)" function to convert the views from the spatial domain to the frequency domain and then encode them into hash codes. For this item , we reduce the weight of context-rich views to avoid introducing too much redundant information.
In the regional subtitle generation task, DynRefer uses a smaller model (4.2B vs. 7B) on the RefCOCOg and VG datasets, In both METEOR and CIDEr indicators, it significantly surpasses many methods in CVPR 2024, such as RegionGPT, GlaMM, Alpha-CLIP and Osprey, etc., demonstrating the huge performance advantage of DynRefer.
In the task of dense subtitle generation, on the VG1.2 data set, DynRefer improved 7.1% mAP compared to the previous SOTA method GRiT. Open Vocabulary Attribute Detection
In the regional attribute detection task, DynRefer also achieved SOTA performance. Open Vocabulary Region Recognition
In the region recognition task, DynRefer improves 15% mAP and 8.8% Accuracy compared with RegionGPT of CVPR 24, and is 15.7% mAP higher than ASM of ICLR 24.
- Line 1-6: Random dynamic multi-view is better than fixed view.
- Line 6-10: Selecting views by maximizing information is better than randomly selecting views.
- Line 10-13: Multi-task training can learn better regional representations.
The following pictures show the inference results of DynRefer. DynRefer can use one model to output regional subtitles, tags, attributes and categories at the same time.
The above is the detailed content of Surpassing the CVPR 2024 method, DynRefer achieves multiple SOTAs in regional-level multi-modal recognition tasks. For more information, please follow other related articles on the PHP Chinese website!