Big recommendation: Visual-RFT - a visual enhancement and fine-tuning open source project to empower visual language models!
The AIxiv column continues to focus on top AI research in the world and has published more than 2,000 academic and technical articles. Welcome to contribute to share your outstanding achievements! Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com
The Visual-RFT (Visual Reinforcement Fine-Tuning) project successfully applies the reinforcement learning and reinforcement fine-tuning (RFT) paradigm based on rule rewards to visual language big models (LVLM), breaking through the limitations of previous methods being limited to text, mathematics and other fields. By designing specific rule rewards for tasks such as visual subcategorization and object detection, Visual-RFT provides a new idea for LVLM training!
Figure 1 shows the powerful generalization ability of Visual-RFT: the model requires only a small amount of data to accurately identify a specific Pokémon in the Visual enhancement fine-tuning! DeepSeek R1 technology has been successfully migrated to multimodal field and is fully open to source and locate its coordinates.
Figure 1. Visual-RFT extends enhanced fine-tuning to multimodal, with only 10-1000 pieces of data to significantly improve model performance.
From RFT to Visual-RFT: Breakthroughs in Reinforcement Learning in Multimodal Field
OpenAI's enhanced fine-tuning technology allows model capability migration to be achieved by just a small number of samples. DeepSeek-R1 reveals that its powerful reasoning abilities stem from reinforcement learning strategies based on verifiable rewards. However, this strategy was previously mainly used in fields such as text and mathematics. Visual-RFT successfully expanded this strategy to the visual field. By constructing verifiable rule rewards, it solved the limitations of traditional methods in the visual field and achieved efficient and highly generalized visual understanding and reasoning.
Traditional visual instruction fine-tuning (SFT) requires a large amount of data, and Visual-RFT's small sample learning ability makes it more advantageous in data scarce scenarios.
In order to verify the generalization ability of Visual-RFT, the research team conducted tests on multiple visual tasks such as object detection, classification, and grounding. The results show that Visual-RFT can achieve significant performance improvements under open vocabulary, small sample learning and other settings, and is better than the SFT method. Especially in inference positioning tasks, Visual-RFT demonstrates excellent visual reasoning capabilities. (See the paper for details)
Figure 2. Visual-RFT significantly surpasses SFT on multiple visual tasks.
Figure 3. Visual-RFT framework diagram, updating model parameters using IoU and cls rewards and reinforcement learning strategies.
The research team used IoU-based verifiable rewards for detection and grounding tasks, and cls rewards based on classification correctness for classification tasks. (as shown in Figure 3)
Figure 4. Inferential positioning results show that Visual-RFT surpasses SFT to locate objects more accurately.
Figure 5. Inferential fine-grained classification results show that Visual-RFT surpasses SFT to locate objects more accurately.
Figure 4 and Figure 5 show the output results of the model. Visual-RFT uses reinforcement learning strategies to conduct in-depth inference analysis and achieves performance better than SFT.
Visual-RFT experimental results
Based on the QWen2-VL 2B/7B model, Visual-RFT comprehensively surpasses SFT in open object detection, small sample detection, fine-grained classification and inference positioning tasks. The experimental data covers common scenes such as COCO and LVIS and open scenes such as Internet cartoon characters. With just a small amount of data, Visual-RFT can achieve capability migration, showing excellent performance and robustness.
Figure 5. Some experimental results show that Visual-RFT significantly surpasses SFT.
Visual-RFT is open source!
The Visual-RFT project is open source and contains training, evaluation code and data. Welcome to participate!
Project address: https://www.php.cn/link/ec56522bc9c2e15be17d11962eeec453
The above is the detailed content of Visual enhancement fine-tuning! DeepSeek R1 technology has been successfully migrated to multimodal field and is fully open to source. For more information, please follow other related articles on the PHP Chinese website!