One trick to distinguish large-scale cheating models, doctor's brother's open source AI mathematical 'demon mirror'-AI-php.cn

Home

Technology peripherals

One trick to distinguish large-scale cheating models, doctor's brother's open source AI mathematical 'demon mirror'

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Nov 17, 2023 pm 12:38 PM

ai data

Nowadays, many big models claim to be good at mathematics. Who has the real talent? Who "cheated" on the back-to-back test questions?

This year, someone conducted a comprehensive test on the questions just announced for the Hungarian National Mathematics Final Examination

Many models suddenly became successful"Now The original shape” .

One trick to distinguish large-scale cheating models, doctors brothers open source AI mathematical demon mirror

Look at the green part first, these large models have similar results on the classic mathematics test set GSM8k and the new paper, Together they form the reference standard .

Looking at the

red part, the result on GSM8K is significantly higher than that of the large model with the same parameter scale.As soon as it arrives The score on the new paper dropped significantly, almost the same as the large model of the same size. The researchers classified them as

"suspected or known to have been trained on GSM8k"

. After watching this test, some people said that they should start evaluating questions that they have never seen before

Some people think that this kind of test And everyone’s actual use experience of large models is currently the only reliable evaluation method One trick to distinguish large-scale cheating models, doctors brothers open source AI mathematical demon mirror

Musk Grok is second only to GPT-4, and the open source Llemma has excellent results One trick to distinguish large-scale cheating models, doctors brothers open source AI mathematical demon mirror

Tester

Keiran Paster

is a PhD student at the University of Toronto, a Google student researcher, and one of the authors of the large Lemma model in the test.

Let the big model take the Hungarian national high school mathematics final exam. This trick comes from One trick to distinguish large-scale cheating models, doctors brothers open source AI mathematical demon mirror

Musk’s xAI

. In order to rule out the problem that xAI's Grok large model accidentally saw test questions in network data, in addition to several common test sets, this test was also conducted

This exam this year The test was only completed at the end of May, and the current large model has basically never had the opportunity to see this set of test questions.

xAI also announced the results of GPT-3.5, GPT-4, and Claude 2 when it was released for comparison.

Based on this set of data, Paster conducted further tests. The test objects were multiple open source models with strong mathematical capabilities One trick to distinguish large-scale cheating models, doctors brothers open source AI mathematical demon mirror

and The test questions, test scripts, and answer results of each model are

open sourced on Huggingface

for everyone to check and further test other models.

The results show that GPT-4 and Claude-2 form the first echelon, with very high scores on GSM8k and new papers. One trick to distinguish large-scale cheating models, doctors brothers open source AI mathematical demon mirror

Although this does not mean that there are no GSM8k leaked questions in the training data of GPT-4 and Claude 2, but at least they have good generalization capabilities and can solve new questions correctly, so they don’t care.

Next, Musk xAI’s Grok-0

(33B)

and Grok-1

(unpublished parameter scale) performed well.

Grok-1 has the highest score in the "non-cheating group", and his new paper score is even higher than Claude 2.

Grok-0's performance on GSM8k is close to GPT3.5-Turbo, and slightly worse on the new paper.

Except for the above-mentioned closed models, the other models in the test are all open source

Code Llama series

is Meta’s own version of Llama 2 It is basically fine-tuned, focusing on generating code based on natural language. Now it seems that the mathematical ability is slightly worse than models of the same scale.

Based on Code Llama, many universities and research institutions jointly launched the One trick to distinguish large-scale cheating models, doctors brothers open source AI mathematical demon mirror

Llemma series

, which was open sourced by EleutherAI. The team collected the Proof-Pile-2 dataset from scientific papers, network data containing mathematics, and mathematical code. After training, Llemma can use tools and do formal theorem proofs without any further fine-tuning.

On the new paper, the performance of Llemma 34B is close to the GPT-3.5 Turbo level

Mistral series is trained by the French AI unicorn Mistral AI. The Apache2.0 open source agreement is more relaxed than Llama, becoming a sheep The most popular basic model in the open source community after the Tuo family.

#OpenChat 3.5 and MetaMath Mistral are all fine-tuned based on the Mistral ecosystem.

MetaMath and MAmmoTH Code are based on the Code Llama ecosystem. Those who choose to adopt open source large models in actual business need to be careful to avoid this group, because they are likely to perform well just to boost the rankings, but their actual capabilities may not be as strong as other models of the same scale

Many netizens expressed their gratitude to Paster for this experiment, believing that this is exactly what is needed to understand the actual situation of the model.

Some people have expressed concerns:

From this day on, everyone who trains large models will add Hungarian math exam questions from previous years.

At the same time, he believes that the solution may be to have a

specialized large model evaluation company with proprietary testing.

Another proposal is to

Establish a test benchmark that is updated year by year to alleviate the overfitting problem.

The above is the detailed content of One trick to distinguish large-scale cheating models, doctor's brother's open source AI mathematical 'demon mirror'. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks ago By DDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

2 weeks ago By DDD

Where to find the Crane Control Keycard in Atomfall

3 weeks ago By DDD

Assassin's Creed Shadows - How To Find The Blacksmith And Unlock Weapon And Armour Customisation

1 months ago By DDD

Roblox: Dead Rails - How To Complete Every Challenge

3 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7629

CakePHP Tutorial

1389

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

141

Related knowledge

Rexas Finance (RXS) can surpass Solana (Sol), Cardano (ADA), XRP and Dogecoin (Doge) in 2025 Apr 21, 2025 pm 02:30 PM

In the volatile cryptocurrency market, investors are looking for alternatives that go beyond popular currencies. Although well-known cryptocurrencies such as Solana (SOL), Cardano (ADA), XRP and Dogecoin (DOGE) also face challenges such as market sentiment, regulatory uncertainty and scalability. However, a new emerging project, RexasFinance (RXS), is emerging. It does not rely on celebrity effects or hype, but focuses on combining real-world assets (RWA) with blockchain technology to provide investors with an innovative way to invest. This strategy makes it hoped to be one of the most successful projects of 2025. RexasFi

Web3 trading platform ranking_Web3 global exchanges top ten summary Apr 21, 2025 am 10:45 AM

Binance is the overlord of the global digital asset trading ecosystem, and its characteristics include: 1. The average daily trading volume exceeds $150 billion, supports 500 trading pairs, covering 98% of mainstream currencies; 2. The innovation matrix covers the derivatives market, Web3 layout and education system; 3. The technical advantages are millisecond matching engines, with peak processing volumes of 1.4 million transactions per second; 4. Compliance progress holds 15-country licenses and establishes compliant entities in Europe and the United States.

What are the top ten platforms in the currency exchange circle? Apr 21, 2025 pm 12:21 PM

The top exchanges include: 1. Binance, the world's largest trading volume, supports 600 currencies, and the spot handling fee is 0.1%; 2. OKX, a balanced platform, supports 708 trading pairs, and the perpetual contract handling fee is 0.05%; 3. Gate.io, covers 2700 small currencies, and the spot handling fee is 0.1%-0.3%; 4. Coinbase, the US compliance benchmark, the spot handling fee is 0.5%; 5. Kraken, the top security, and regular reserve audit.

Top 10 cryptocurrency exchange platforms The world's largest digital currency exchange list Apr 21, 2025 pm 07:15 PM

Exchanges play a vital role in today's cryptocurrency market. They are not only platforms for investors to trade, but also important sources of market liquidity and price discovery. The world's largest virtual currency exchanges rank among the top ten, and these exchanges are not only far ahead in trading volume, but also have their own advantages in user experience, security and innovative services. Exchanges that top the list usually have a large user base and extensive market influence, and their trading volume and asset types are often difficult to reach by other exchanges.

Global Asset launches new AI-driven intelligent trading system to improve global trading efficiency Apr 20, 2025 pm 09:06 PM

Global Assets launches a new AI intelligent trading system to lead the new era of trading efficiency! The well-known comprehensive trading platform Global Assets officially launched its AI intelligent trading system, aiming to use technological innovation to improve global trading efficiency, optimize user experience, and contribute to the construction of a safe and reliable global trading platform. The move marks a key step for global assets in the field of smart finance, further consolidating its global market leadership. Opening a new era of technology-driven and open intelligent trading. Against the backdrop of in-depth development of digitalization and intelligence, the trading market's dependence on technology is increasing. The AI intelligent trading system launched by Global Assets integrates cutting-edge technologies such as big data analysis, machine learning and blockchain, and is committed to providing users with intelligent and automated trading services to effectively reduce human factors.

'Black Monday Sell' is a tough day for the cryptocurrency industry Apr 21, 2025 pm 02:48 PM

The plunge in the cryptocurrency market has caused panic among investors, and Dogecoin (Doge) has become one of the hardest hit areas. Its price fell sharply, and the total value lock-in of decentralized finance (DeFi) (TVL) also saw a significant decline. The selling wave of "Black Monday" swept the cryptocurrency market, and Dogecoin was the first to be hit. Its DeFiTVL fell to 2023 levels, and the currency price fell 23.78% in the past month. Dogecoin's DeFiTVL fell to a low of $2.72 million, mainly due to a 26.37% decline in the SOSO value index. Other major DeFi platforms, such as the boring Dao and Thorchain, TVL also dropped by 24.04% and 20, respectively.

How to avoid losses after ETH upgrade Apr 21, 2025 am 10:03 AM

After ETH upgrade, novices should adopt the following strategies to avoid losses: 1. Do their homework and understand the basic knowledge and upgrade content of ETH; 2. Control positions, test the waters in small amounts and diversify investment; 3. Make a trading plan, clarify goals and set stop loss points; 4. Profil rationally and avoid emotional decision-making; 5. Choose a formal and reliable trading platform; 6. Consider long-term holding to avoid the impact of short-term fluctuations.

How to win KERNEL airdrop rewards on Binance Full process strategy Apr 21, 2025 pm 01:03 PM

In the bustling world of cryptocurrencies, new opportunities always emerge. At present, KernelDAO (KERNEL) airdrop activity is attracting much attention and attracting the attention of many investors. So, what is the origin of this project? What benefits can BNB Holder get from it? Don't worry, the following will reveal it one by one for you.

See all articles