Home > Technology peripherals > AI > Visual enhancement fine-tuning! DeepSeek R1 technology has been successfully migrated to multimodal field and is fully open to source

Visual enhancement fine-tuning! DeepSeek R1 technology has been successfully migrated to multimodal field and is fully open to source

Linda Hamilton
Release: 2025-03-12 13:12:02
Original
236 people have browsed it

Big recommendation: Visual-RFT - a visual enhancement and fine-tuning open source project to empower visual language models!

Visual enhancement fine-tuning! DeepSeek R1 technology has been successfully migrated to multimodal field and is fully open to source

The AIxiv column continues to focus on top AI research in the world and has published more than 2,000 academic and technical articles. Welcome to contribute to share your outstanding achievements! Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com

The Visual-RFT (Visual Reinforcement Fine-Tuning) project successfully applies the reinforcement learning and reinforcement fine-tuning (RFT) paradigm based on rule rewards to visual language big models (LVLM), breaking through the limitations of previous methods being limited to text, mathematics and other fields. By designing specific rule rewards for tasks such as visual subcategorization and object detection, Visual-RFT provides a new idea for LVLM training!

Figure 1 shows the powerful generalization ability of Visual-RFT: the model requires only a small amount of data to accurately identify a specific Pokémon in the Visual enhancement fine-tuning! DeepSeek R1 technology has been successfully migrated to multimodal field and is fully open to source and locate its coordinates.

Visual enhancement fine-tuning! DeepSeek R1 technology has been successfully migrated to multimodal field and is fully open to source

Figure 1. Visual-RFT extends enhanced fine-tuning to multimodal, with only 10-1000 pieces of data to significantly improve model performance.

From RFT to Visual-RFT: Breakthroughs in Reinforcement Learning in Multimodal Field

OpenAI's enhanced fine-tuning technology allows model capability migration to be achieved by just a small number of samples. DeepSeek-R1 reveals that its powerful reasoning abilities stem from reinforcement learning strategies based on verifiable rewards. However, this strategy was previously mainly used in fields such as text and mathematics. Visual-RFT successfully expanded this strategy to the visual field. By constructing verifiable rule rewards, it solved the limitations of traditional methods in the visual field and achieved efficient and highly generalized visual understanding and reasoning.

Traditional visual instruction fine-tuning (SFT) requires a large amount of data, and Visual-RFT's small sample learning ability makes it more advantageous in data scarce scenarios.

In order to verify the generalization ability of Visual-RFT, the research team conducted tests on multiple visual tasks such as object detection, classification, and grounding. The results show that Visual-RFT can achieve significant performance improvements under open vocabulary, small sample learning and other settings, and is better than the SFT method. Especially in inference positioning tasks, Visual-RFT demonstrates excellent visual reasoning capabilities. (See the paper for details)

Visual enhancement fine-tuning! DeepSeek R1 technology has been successfully migrated to multimodal field and is fully open to source

Figure 2. Visual-RFT significantly surpasses SFT on multiple visual tasks.

Visual enhancement fine-tuning! DeepSeek R1 technology has been successfully migrated to multimodal field and is fully open to source

Figure 3. Visual-RFT framework diagram, updating model parameters using IoU and cls rewards and reinforcement learning strategies.

The research team used IoU-based verifiable rewards for detection and grounding tasks, and cls rewards based on classification correctness for classification tasks. (as shown in Figure 3)

Visual enhancement fine-tuning! DeepSeek R1 technology has been successfully migrated to multimodal field and is fully open to source

Figure 4. Inferential positioning results show that Visual-RFT surpasses SFT to locate objects more accurately.

Visual enhancement fine-tuning! DeepSeek R1 technology has been successfully migrated to multimodal field and is fully open to source

Figure 5. Inferential fine-grained classification results show that Visual-RFT surpasses SFT to locate objects more accurately.

Figure 4 and Figure 5 show the output results of the model. Visual-RFT uses reinforcement learning strategies to conduct in-depth inference analysis and achieves performance better than SFT.

Visual-RFT experimental results

Based on the QWen2-VL 2B/7B model, Visual-RFT comprehensively surpasses SFT in open object detection, small sample detection, fine-grained classification and inference positioning tasks. The experimental data covers common scenes such as COCO and LVIS and open scenes such as Internet cartoon characters. With just a small amount of data, Visual-RFT can achieve capability migration, showing excellent performance and robustness.

Visual enhancement fine-tuning! DeepSeek R1 technology has been successfully migrated to multimodal field and is fully open to source

Visual enhancement fine-tuning! DeepSeek R1 technology has been successfully migrated to multimodal field and is fully open to source

Figure 5. Some experimental results show that Visual-RFT significantly surpasses SFT.

Visual-RFT is open source!

The Visual-RFT project is open source and contains training, evaluation code and data. Welcome to participate!

Project address: https://www.php.cn/link/ec56522bc9c2e15be17d11962eeec453

The above is the detailed content of Visual enhancement fine-tuning! DeepSeek R1 technology has been successfully migrated to multimodal field and is fully open to source. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Latest Articles by Author
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template