7B open source model, the mathematical power exceeds the 100 billion-scale GPT-4!
Its performance can be said to have broken through the limits of the open source model. Even researchers from Alibaba Tongyi lamented whether the scaling law has failed.
Without any external tools, it achieves an accuracy of 51.7% on the competition-level MATH dataset.
Among the open source models, it is the first to achieve half accuracy on this dataset, even surpassing the early and API versions of GPT-4.
This performance shocked the entire open source community, with Stability AI founder Emad Mostaque praising the R&D team as "impressive" and with "underestimated potential".
It is the deep search team’s latest open source 7B large mathematical model DeepSeekMath.
In order to evaluate the mathematical ability of DeepSeekMath, the research team used Chinese (MGSM-zh, CMATH) English (GSM8K, MATH )Bilingual data set was tested.
Without using auxiliary tools and relying only on the prompts of the chain of thought (CoT) , DeepSeekMath's performance surpassed other open source models, including the 70B large mathematical model MetaMATH.
Compared with the 67B general large model launched by the company, DeepSeekMath's results have also been significantly improved.
If we consider the closed-source model, DeepSeekMath also surpassed Gemini Pro and GPT-3.5 on several data sets, and surpassed GPT-4 on Chinese CMATH. The performance on MATH is also close to it.
But it should be noted that GPT-4 is a behemoth with hundreds of billions of parameters according to leaked specifications, while DeepSeekMath has only 7B parameters.
If the tool (Python) is allowed to be used for assistance, DeepSeekMath's performance on the competition difficulty (MATH) data set is still good can be increased by another 7 percentage points.
So, what technologies are applied behind the excellent performance of DeepSeekMath?
In order to obtain better mathematical capabilities than from the general model, the research team used the code model DeepSeek-Coder-v1.5 to initialize it.
Because the team found that, whether in a two-stage training or a one-stage training setting, code training can improve the mathematical capabilities of the model compared to general data training.
Based on Coder, the research team continued to train 500 billion tokens. The data distribution is as follows:
In terms of training data, DeepSeekMath uses 120B of high-quality mathematical webpage data extracted from Common Crawl to obtain DeepSeekMath Corpus. The total data volume is 9 times that of the open source data set OpenWebMath.
The data collection process is carried out iteratively. After four iterations, the research team collected more than 35 million mathematical web pages, and the number of Tokens reached 120 billion.
In order to ensure that the training data does not contain the content of the test set (because the content in GSM8K and MATH exists in large quantities on the Internet), the research team also Specially filtered.
In order to verify the data quality of DeepSeekMath Corpus, the research team trained 150 billion tokens using multiple data sets such as MathPile. As a result, Corpus was significantly ahead in multiple mathematical benchmarks.
In the alignment stage, the research team first constructed a 776K sample Chinese and English mathematics guided supervised fine-tuning (SFT) data set, including CoT, PoT and tool-integrated reasoning and other three formats.
In the reinforcement learning (RL) stage, the research team used an efficient method called "group-based relative policy optimization" (Group Relative Policy Optimization, GRPO) algorithm.
GRPO is a variant of proximal policy optimization (PPO) . In the process, the traditional value function is replaced by a group-based relative reward estimate, which can reduce the complexity of the training process. Computational and memory requirements.
At the same time, GRPO is trained through an iterative process, and the reward model is continuously updated based on the output of the policy model to ensure continuous improvement of the policy.
The in-depth search team that launched DeepSeekMath is a "leading player" in the field of domestic open source models.
Previously, the team had launched the first domestic open source MoE model DeepSeek MoE. Its 7B version defeated the dense model Llama 2 of the same scale with 40% of the calculation amount.
As a general model, DeepSeek MoE's performance on coding and mathematical tasks is already very impressive, and its resource consumption is very low.
In terms of code, the programming capabilities of DeepSeek-Coder launched by the team exceed CodeLllama, an open source benchmark of the same scale.
At the same time, it also defeated GPT-3.5-Turbo and became the open source code model closest to GPT-4-Turbo.
As mentioned above, the DeepSeekMath launched this time is also built on the basis of Coder.
On X, some people are already looking forward to the MoE versions of Coder and Math.
Paper address: https://arxiv.org/abs/2402.03300
The above is the detailed content of 7B open source mathematical model defeats billions of GPT-4, produced by a Chinese team. For more information, please follow other related articles on the PHP Chinese website!