Object tracking is one of the basic tasks of computer vision. In recent years, single-modality (RGB) object tracking has made significant progress. However, due to the limitations of a single imaging sensor, we need to introduce multi-modal images (such as RGB, infrared, etc.) to make up for this shortcoming to achieve all-weather target tracking in complex environments. The application of such multi-modal images can provide more comprehensive information and enhance the accuracy and robustness of target detection and tracking. The development of multimodal target tracking is of great significance for realizing higher-level computer vision applications.
However, existing multi-modal tracking tasks also face two main problems:
Many multi-modal tracking efforts that pre-train on RGB sequences and then fully fine-tune to multi-modal scenes have time and efficiency issues, as well as limited performance.
In addition to the complete fine-tuning method, it is also inspired by the efficient fine-tuning method of parameters in the field of natural language processing (NLP). Some recent methods have introduced parameter-efficient prompt fine-tuning in multi-modal tracking. These methods do this by freezing the backbone network parameters and adding an additional set of learnable parameters.
Typically, these methods focus on one modality (usually RGB) as the primary modality and the other modality as the auxiliary modality. However, this method ignores the dynamic correlation between multi-modal data and therefore cannot fully utilize the complementary effects of multi-modal information in complex scenes, thus limiting the tracking performance.
Figure 1: Different dominant modes in complex scenarios.
To solve the above problems, researchers from Tianjin University proposed a solution called Bidirectional Adapter for Multimodal Tracking (BAT). Different from traditional methods, the BAT method does not rely on fixed dominant mode and auxiliary mode, but obtains better performance in the change of auxiliary mode to dominant mode through the process of dynamically extracting effective information. The innovation of this method is that it can adapt to different data characteristics and task requirements, thereby improving the representation ability of the basic model in downstream tasks. By using the BAT method, researchers hope to provide a more flexible and efficient multi-modal tracking solution, bringing better results to research and applications in related fields.
BAT consists of two base model encoders with shared parameters specific to the modal branches and a general bidirectional adapter. During the training process, BAT did not fully fine-tune the basic model, but adopted a step-by-step training method. Each specific modality branch is initialized by using the base model with fixed parameters, and only the newly added bidirectional adapters are trained. Each modal branch learns cue information from other modalities and combines it with the feature information of the current modality to enhance representation capabilities. Two modality-specific branches interact through a universal bidirectional adapter to dynamically fuse dominant and auxiliary information with each other to adapt to the paradigm of multi-modal non-fixed association. This design enables BAT to fine-tune the content without changing the meaning of the original content, improving the model's representation ability and adaptability.
The universal bidirectional adapter adopts a lightweight hourglass structure and can be embedded into each layer of the transformer encoder of the basic model to avoid introducing a large number of learnable parameters. By adding only a small number of training parameters (0.32M), the universal bidirectional adapter has lower training cost and achieves better tracking performance compared with fully fine-tuned methods and cue learning-based methods.
The paper "Bi-directional Adapter for Multi-modal Tracking":
Paper link: https ://arxiv.org/abs/2312.10611
Code link: https://github.com/SparkTempest/BAT
As shown in Figure 2, we propose a multi-modal tracking visual cue framework based on a bidirectional Adapter (BAT), the framework has a dual-stream encoder structure with RGB modality and thermal infrared modality, and each stream uses the same basic model parameters. The bidirectional Adapter is set up in parallel with the dual-stream encoder layer to cross-cue multimodal data from the two modalities.
The method does not completely fine-tune the basic model. It only efficiently transfers the pre-trained RGB tracker to multi-modal scenes by learning a lightweight bidirectional Adapter. It achieves excellent multi-modal complementarity and excellent tracking accuracy.
Figure 2: Overall architecture of BAT.
First, the template frame of each modality (the initial frame of the target object in the first frame) and search frames (subsequent tracking images) are converted into , and they are spliced together and passed to the N-layer dual-stream transformer encoder respectively.
Bidirectional adapter is set up in parallel with the dual-stream encoder layer to learn feature cues from one modality to another. For this purpose, the output features of the two branches are added and input into the prediction head H to obtain the final tracking result box B.
The bidirectional adapter adopts a modular design and is embedded in the multi-head self-attention stage and MLP stage respectively, as shown on the right side of Figure 1. Detailed structures designed to transfer feature cues from one modality to another. It consists of three linear projection layers, tn represents the number of tokens in each modality, the input token is first dimensionally reduced to de through down projection and passes through a linear projection layer, and then projected upward to the original dimension dt and fed back as a feature prompt Transformer encoder layers to other modalities.
Through this simple structure, the bidirectional adapter can effectively perform feature prompts between modalities to achieve multi-modal tracking.
Since the transformer encoder and prediction head are frozen, only the parameters of the newly added adapter need to be optimized. Notably, unlike most traditional adapters, our bidirectional adapter functions as a cross-modal feature cue for dynamically changing dominant modalities, ensuring good tracking performance in the open world.
As shown in Table 1, the comparison on the two data sets of RGBT234 and LasHeR shows that our method has both accuracy and success rate. Outperforms state-of-the-art methods. As shown in Figure 3, the performance comparison with state-of-the-art methods under different scene properties of the LasHeR dataset also demonstrates the superiority of the proposed method.
These experiments fully prove that our dual-stream tracking framework and bidirectional Adapter successfully track targets in most complex environments and adaptively switch from dynamically changing dominant-auxiliary modes Extract effective information from the system and achieve state-of-the-art performance.
Table 1 Overall performance on RGBT234 and LasHeR datasets.
Figure 3 Comparison of BAT and competing methods under different attributes in the LasHeR dataset.
Experiments demonstrate our effectiveness in dynamically prompting effective information from changing dominant-auxiliary patterns in complex scenarios. As shown in Figure 4, compared with related methods that fix the dominant mode, our method can effectively track the target even when RGB is completely unavailable, when both RGB and TIR can provide effective information in subsequent scenes. , the tracking effect is much better. Our bidirectional Adapter dynamically extracts effective features of the target from both RGB and IR modalities, captures more accurate target response locations, and eliminates interference from the RGB modality.
# Figure 4 Visualization of tracking results.
# We also evaluate our method on the RGBE trace dataset. As shown in Figure 5, compared with other methods on the VisEvent test set, our method has the most accurate tracking results in different complex scenarios, proving the effectiveness and generalization of our BAT model.
Figure 5 Tracking results under the VisEvent data set.
Figure 6 Attention weight visualization.
We visualize the attention weights of different layers tracking targets in Figure 6. Compared with the baseline-dual (dual-stream framework for basic model parameter initialization) method, our BAT effectively drives the auxiliary mode to learn more complementary information from the dominant mode, while maintaining the effectiveness of the dominant mode as the network depth increases. performance, thereby improving overall tracking performance.
Experiments show that BAT successfully captures multi-modal complementary information and achieves sample adaptive dynamic tracking.
The above is the detailed content of BAT method: AAAI 2024's first multi-modal target tracking universal bidirectional adapter. For more information, please follow other related articles on the PHP Chinese website!