ACL 2024 | In the mathematical evaluation of 25 open and closed source models, GPT-3.5-Turbo barely passed-AI-php.cn

Home

Technology peripherals

ACL 2024 | In the mathematical evaluation of 25 open and closed source models, GPT-3.5-Turbo barely passed

PHPz

Jul 19, 2024 pm 01:53 PM

project Evaluation Baseline GSM-Plus

ACL 2024 | 对25个开闭源模型数学评测，GPT-3.5-Turbo才勉强及格

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com

The author of this article is from the University of Hong Kong and Tencent. Author list: Li Qintong, Leyang Cui, Zhao Xueliang, Kong Lingpeng, Wei Bi. Among them, the first author Li Qintong is a doctoral student in the Natural Language Processing Laboratory of the University of Hong Kong. His research interests involve natural language generation and text reasoning. He and doctoral student Zhao Xueliang are under the tutelage of Professor Kong Lingpeng. Leyang Cui and Wei Bi are senior researchers at Tencent.

Foreword

The extraordinary ability of large language models (LLMs) in solving problems is increasingly apparent. Recently, a phenomenon worthy of attention is that these models have achieved amazing results in multiple benchmark tests of mathematical reasoning. Taking GPT-4 as an example, it performed well in the difficult primary school application question test set GSM8K [1], with an accuracy rate of over 90%. At the same time, many open source models have also shown impressive performance, with accuracy rates exceeding 80%.

However, in use, we often find that when the mathematical problems are slightly changed, LLMs may have some low-level errors, as shown in the following figure:

ACL 2024 | 对25个开闭源模型数学评测，GPT-3.5-Turbo才勉强及格

^{Figure 1: GPT-3.5-Turbo A math problem was solved correctly (left), but when a constraint was added to the original problem (right), Turbo misused an operator and made an error because it did not correctly differentiate between "leave" and "return" directions.}

We can’t help but ask: Do large language models really grasp the essence of mathematical knowledge? How do they score so high on these tests? Is it simply a matter of imitating superficial reasoning patterns in large amounts of training data? Whether LLMs truly understand mathematical concepts is still a question worth exploring.

To explore this issue, the author of this article designed an evaluation benchmark GSM-Plus. This test is designed to perform 8 different fine-grained mathematical transformations on a problem to systematically assess the ability of current LLMs in dealing with basic mathematical word problems. In this new benchmark, the paper rigorously evaluates 25 different LLMs, including open source and closed source models in the industry.

Experimental results show that GSM-Plus is a challenging benchmark for most LLMs. Even on GSM8K, GPT-3.5-Turbo has been able to achieve an accuracy of 73.62%, but it can only achieve an accuracy of 61.19% on GSM-Plus. This work has been accepted by ACL2024 with scores of 4, 4, and 4.5.

ACL 2024 | 对25个开闭源模型数学评测，GPT-3.5-Turbo才勉强及格

Paper title: GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers
Paper address: https://arxiv.org/pdf/2402.19255
Paper homepage: https: //qtli.github.io/GSM-Plus/

Background

Mathematical reasoning is an important proof of the development of artificial intelligence. It requires rigorous problem understanding, strategy development, and computational execution skills. Over the past few years, numerous publicly available datasets have been used to evaluate the mathematical reasoning capabilities of artificial intelligence systems. Early math datasets focused on equation-based math problems. Subsequently, more difficult data sets were introduced covering elementary, high school, and college level mathematics problems.

As the difficulty of evaluation data continues to increase, the development of LLMs has also become very rapid. In order to improve the performance of LLMs in the field of mathematics, supervised fine-tuning (SFT) can be used to quickly help LLMs adapt to the field of mathematics by training on diverse task data. In the reasoning stage, the mathematical abilities of LLMs can also be effectively stimulated through cleverly designed input prompts (e.g., Chain-of-Thought and Program-of-Thought).

For most LLMs, there is still a lot of room for improvement when it comes to math problems in high school and above. However, in the area of primary school mathematics, LLMs have shown great potential. This makes us wonder, Can LLMs still maintain high performance in real environments?

Adversarial Evaluation Dataset GSM-Plus

This study aims to launch a comprehensive benchmark GSM-Plus to systematically examine the robustness of LLMs in solving basic mathematical problems . Inspired by the taxonomy of ability to solve mathematical problems in Polya principles [2], this article identifies five guiding principles for constructing the GSM-Plus data set:

For ease of understanding, here we use "Janet's The duck lays 16 eggs every day. She eats three eggs every morning and bakes muffins with four eggs for her friends. She sells the remaining eggs at the farmer's market every day for $2 each. How many dollars do you make at the farmer’s market?” question for example.

(1) Numerical change : refers to changing numerical data or its type. This article defines three subcategories:

Numeric substitution: replacing numeric values with the same digits and types, such as Replace "16" with "20" in the question.
Digit expansion: Increase the number of digits in a value, for example, replace "16" with "1600".
Integer - Decimal - Fraction conversion: Replace integers with decimals or fractions, for example convert "2" to "2.5".

(2) Arithmetic changes : refers to the introduction of additional operations or inversions to mathematical problems, but only limited to addition, subtraction, multiplication, and division operations:

Operation expansion: Add restrictions to the original problem. For example, add a new condition "She also uses two eggs to make homemade hair masks every day."
Operation reversal: Convert a known condition of the original problem into the variables to be solved for the GSM-Plus variant problem. For example, the statement of the original question in Figure 2 "2 US dollars per duck egg" is converted into the interrogative sentence of the new question "What is the price of each duck egg?", while the interrogative sentence of the original question "How many dollars do you earn at the farmer's market every day?" is converted into Known conditions for the new problem "She earns $18 a day at the farmer's market"

(3) Problem understanding: Refers to restating the mathematical problem with different words and sentences without changing the meaning, such as " Janet raises a flock of ducks that lay 16 duck eggs every day. She consumes three duck eggs for breakfast and then consumes four duck eggs to bake muffins for her friends. Janet sells fresh duck eggs at the farmer's market for $2 each. Sell all the remaining duck eggs at the price. How much money does she make every day by selling duck eggs at the farmer's market? "

(4) Interference insertion : refers to inserting sentences that are related to the topic and contain numerical values but are useless for solving Go to the original question, such as "Janet also wants to use two duck eggs to feed her pet parrot. Fortunately, her neighbor gives her two duck eggs every day to feed the parrot."

(5) Critical Thinking: Focuses on whether LLMs have the ability to ask questions or doubt when mathematical problems lack necessary conditions, such as “Janet’s ducks lay eggs every day. She eats three eggs every morning as She makes four eggs for breakfast and bakes waffles for her friends every day. She sells the remaining eggs at the farmer's market every day for $2 each.How many dollars does she make each day at the farmers market? ”.

Based on 1,319 test questions of GSM8K, this paper creates eight variants for each question, resulting in a GSM-Plus dataset containing 10,552 question variants (this paper also provides a GSM-Plus dataset containing 2,400 A test subset of problem variants for quick evaluation). By testing LLMs using each problem and its eight variants, GSM-Plus can help researchers comprehensively evaluate the robustness of LLMs in solving mathematical problems.个 Figure 2: Based on a seed mathematical problem, use 8 angles of 8 disturbances to generate problem variants. LLMs of different scales, different pre-training methods, different task fine-tuning, and a combination of 4 commonly used prompting technologies. This paper finds that LLMs can accurately solve the GSM8K problem as a whole, but will encounter obvious problems when answering variant questions in GSM-Plus. Difficulty. The main findings are as follows:

ACL 2024 | 对25个开闭源模型数学评测，GPT-3.5-Turbo才勉强及格

^{Task-specific optimization, that is, fine-tuning on mathematically relevant datasets, can often improve downstream task accuracy; while the level of robustness depends more on the underlying model and Fine-tune the selection of data sets.}

The performance of LLMs degrades rapidly when “critical thinking” is required, “arithmetic changes” and “distractor insertion” are involved; but for “numerical changes” and “problem understanding” The performance of perturbation, LLMs is relatively stable.

Previous prompting techniques (e.g., CoT, PoT, LtM and Complexity-based CoT) have not significant effect on robustness enhancement, especially for "arithmetic changes" and "Critical Thinking". Based on previous work, this paper further explores a combined prompt method that can simultaneously improve the performance of LLMs on GSM8K and GSM-Plus by iteratively generating and verifying each reasoning thought.

Quality Assurance

: Use two stages to generate GSM-Plus evaluation questions. First, use GPT-4’s question rewriting capabilities to generate question variants, and then generate questions for these variants. Generate candidate answers; to ensure data quality, all question variations and answers generated by GPT-4 are rigorously checked by the manual annotation team. The manual annotation team corrected 18.85% of the GPT-4 rewritten problems.
Fine-grained evaluation

: For each test question in the mainstream evaluation data set GSM8K, GSM-Plus provides 8 variant questions in perturbation directions, fully testing the large model's ability to flexibly solve mathematical application problems in different contexts.

Challenging
: Compared with GSM8K, the problem variant of GSM-Plus is more challenging, and the performance of all LLMs participating in the evaluation drops significantly. In the following analysis, this article will specifically analyze the problem-solving robustness of LLMs under different types of perturbations.

^{Table 1: Different colors represent different perturbation types:}^{numeric substitution, digit expansion, integer-decimal-fraction conversion, operation expansion, operation reversal, Problem understanding, Distractor insertion,}^{Critical thinking.}

As can be seen from the above table, previous studies used different perturbations to test the robustness of mathematical reasoning, but the evaluation settings only cover some perturbation types, and most of them introduce perturbations through automatic method construction, quality Hard to guarantee. In contrast, GSM-Plus uses eight different mathematical reasoning skills to perturb a single problem, with more comprehensive coverage and strict quality control.

Experimental Analysis

Evaluation Metrics
- Performance Reduction Rate (PDR): Performance of LLMs on the perturbed problem compared to the original problem fall degree.
- Percentage of simultaneously solved problem pairs (ASP): The proportion of the original problem and its corresponding problem variant that are both answered correctly by LLMs.
Overall Performance

As shown in the table below, the performance of most LLMs on GSM-Plus drops significantly compared to GSM8K.

GPT-4 shows the highest robustness, with the smallest PDR of only 8.23%. CodeLlama has the largest PDR, among which the 7B, 13B and 34B models are 40.56%, 39.71% and 34.27% respectively, exceeding its base model LLaMA-2-7B (39.49%), as well as the mathematical SFT model fine-tuned on it. , such as SEGO-7B (34.91%). This shows that reasoning using only procedural languages is vulnerable to perturbations.

In the face of mathematical perturbations, the larger the model size, the more stable the performance. Although supervised fine-tuning can improve accuracy on downstream tasks, it does not significantly enhance the model's robustness to perturbations (i.e., lower PDR). Data that supervises fine-tuning is important for robustness. It is also fine-tuned based on LLaMA-2 and uses different data, which will lead to large differences in the accuracy and robustness of the model. Table 2: Overall performance Performance of LLMs under disturbance

This paper further evaluates LLMs in 8 types of Performance stability under problem variants. Compared to human baseline for Critical Thinking (purple), Operation Expansion and Operation Reversal (blue), Distractor Insertion (pink), and Integer-Decimal-Fraction Conversion (orange) perturbation, the performance of LLMs decreases significantly. For "numeric replacement" and "problem understanding", the performance of LLMs is stable or even slightly improved.
The previous analysis is mainly based on the entire data set. Next, this article splits the two data sets according to whether the math questions are answered correctly, and analyzes whether when LLMs successfully solve the GSM8K problem, it means that the probability of correctly answering the GSM-Plus variant question becomes higher (i.e., a high ASP value). vice versa. If this assertion holds true, LLMs can be considered to perform stably on this specific subset of mathematical problems, even if this is not the case on the entire data set. In the experimental setup, each GSM8K problem and its variants in GSM-Plus are transformed into 8 problem pairs, and the results are shown in Figure 4 .

Figure 4: Inference transferability of LLMs between GSM8K and GSM-Plus problem pairs. Purple (both correct) and blue (both incorrect) bars indicate consistent model behavior, while red (GSM8K correct & GSM-Plus incorrect) and yellow (GSM8K incorrect & GSM-Plus correct) bars indicate Inconsistent model behavior. The sum of the heights of the purple and red bars represents the number of LLMs that correctly solved the GSM8K problem.
The presence of red bars (LLMs that correctly answer the original question, but do not address the variant), indicates that most models have limited performance transferability. Although the performance of LLMs differs on the GSM8K problem (height of purple and red bars), performance transferability is similar (height of red bars). This means that existing benchmarks cannot accurately assess a model's true capabilities in mathematical reasoning. High accuracy does not equate to strong inference robustness.
Hints help in performance robustness of LLMs

Previous work has shown that good hint instructions are important to stimulate the mathematical ability of language models. This article selects 4 representative models and tests their performance in solving problems under different prompt instructions. As shown in the figure below, when faced with interference, LLMs perform most stably when using complex examples as contextual demonstrations (Complexity-based CoT); in contrast, only using program language to represent intermediate reasoning (Program-of-Thought) , LLMs are more susceptible to interference. Overall, these tips and tricks are not enough for LLMs to maintain the same performance as GSM8K on GSM-Plus. L Figure 5: The impact of prompt on LLMS performance robustness

Is the combination prompt valid?
How to enhance the robustness of LLMs based on existing hint methods?
This article found that LLMs often ignore important conditions or make calculation errors during the problem solving process. To this end, this paper explores Comp, a combined prompting method. The method first prompts LLMs to extract numerically relevant necessary conditions in the problem (Prompt1). Next, based on the problem and critical conditions, LLMs are instructed to iteratively generate inference goals (Prompt2) and calculation goals (Prompt3), and let them provide feedback on the generated historical problem-solving steps to determine whether the final answer is obtained (Prompt4). The specific implementation is shown in Figure 6.
It can be seen that Comp can improve the performance of LLMs under various problem change types through iterative generation and self-verification, but it still The performance gap between LLMs on standard test sets and adversarial test sets cannot be bridged. This research looks forward to more methods in the future to further improve the robustness of the model and promote the further development of LLMs in the field of mathematical reasoning.
Jadual 3: Prestasi lelaran Comp menggesa Plus untuk menulis semula soalan, di bawah teknik gesaan yang berbeza Prestasi GPT-3.5-Turbo. Walaupun semua gesaan mendorong Turbo untuk menjawab soalan GSM8K dengan tepat, hanya Comp yang dapat membantu Turbo menjana jawapan yang betul pada soalan varian GSM-Plus.

Artikel ini memperkenalkan set penilaian soalan aplikasi matematik sekolah rendah yang bermusuhan GSM -Plus, yang direka bentuk untuk menyelesaikan masalah matematik secara sistematik bagi LLM. Analisis percubaan mendapati bahawa prestasi kebanyakan LLM menurun dengan ketara berbanding prestasi mereka pada penanda aras standard apabila berhadapan dengan gangguan, jauh lebih rendah daripada tahap prestasi manusia. Para penyelidik berharap bahawa kerja artikel ini dapat mempromosikan lebih banyak penyelidikan masa depan, termasuk tetapi tidak terhad kepada: (1) penilaian sistematik kemahiran matematik LLMs; (2) membina model yang boleh melakukan penaakulan matematik secara fleksibel.

Pautan rujukan

^{[1] Cobbe, Karl, et al. com/sota/aritmetik-penaakulan-pada-gsm8k}

[2] George Polya 2004. Cara menyelesaikannya: Aspek baharu kaedah matematik, jilid 85. Princeton university press.
.

The above is the detailed content of ACL 2024 | In the mathematical evaluation of 25 open and closed source models, GPT-3.5-Turbo barely passed. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

How to fix KB5055523 fails to install in Windows 11?

4 weeks ago By DDD

How to fix KB5055518 fails to install in Windows 10?

4 weeks ago By DDD

Roblox: Grow A Garden - Complete Mutation Guide

3 weeks ago By DDD

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

How to fix KB5055612 fails to install in Windows 10?

3 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Java Tutorial

1664

CakePHP Tutorial

1422

Laravel Tutorial

1316

PHP Tutorial

1267

C# Tutorial

1239

Related knowledge

The author of ControlNet has another hit! The whole process of generating a painting from a picture, earning 1.4k stars in two days Jul 17, 2024 am 01:56 AM

It is also a Tusheng video, but PaintsUndo has taken a different route. ControlNet author LvminZhang started to live again! This time I aim at the field of painting. The new project PaintsUndo has received 1.4kstar (still rising crazily) not long after it was launched. Project address: https://github.com/lllyasviel/Paints-UNDO Through this project, the user inputs a static image, and PaintsUndo can automatically help you generate a video of the entire painting process, from line draft to finished product. follow. During the drawing process, the line changes are amazing. The final video result is very similar to the original image: Let’s take a look at a complete drawing.

Topping the list of open source AI software engineers, UIUC's agent-less solution easily solves SWE-bench real programming problems Jul 17, 2024 pm 10:02 PM

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com The authors of this paper are all from the team of teacher Zhang Lingming at the University of Illinois at Urbana-Champaign (UIUC), including: Steven Code repair; Deng Yinlin, fourth-year doctoral student, researcher

From RLHF to DPO to TDPO, large model alignment algorithms are already 'token-level' Jun 24, 2024 pm 03:04 PM

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com In the development process of artificial intelligence, the control and guidance of large language models (LLM) has always been one of the core challenges, aiming to ensure that these models are both powerful and safe serve human society. Early efforts focused on reinforcement learning methods through human feedback (RL

arXiv papers can be posted as 'barrage', Stanford alphaXiv discussion platform is online, LeCun likes it Aug 01, 2024 pm 05:18 PM

cheers! What is it like when a paper discussion is down to words? Recently, students at Stanford University created alphaXiv, an open discussion forum for arXiv papers that allows questions and comments to be posted directly on any arXiv paper. Website link: https://alphaxiv.org/ In fact, there is no need to visit this website specifically. Just change arXiv in any URL to alphaXiv to directly open the corresponding paper on the alphaXiv forum: you can accurately locate the paragraphs in the paper, Sentence: In the discussion area on the right, users can post questions to ask the author about the ideas and details of the paper. For example, they can also comment on the content of the paper, such as: "Given to

Posthumous work of the OpenAI Super Alignment Team: Two large models play a game, and the output becomes more understandable Jul 19, 2024 am 01:29 AM

If the answer given by the AI model is incomprehensible at all, would you dare to use it? As machine learning systems are used in more important areas, it becomes increasingly important to demonstrate why we can trust their output, and when not to trust them. One possible way to gain trust in the output of a complex system is to require the system to produce an interpretation of its output that is readable to a human or another trusted system, that is, fully understandable to the point that any possible errors can be found. For example, to build trust in the judicial system, we require courts to provide clear and readable written opinions that explain and support their decisions. For large language models, we can also adopt a similar approach. However, when taking this approach, ensure that the language model generates

A significant breakthrough in the Riemann Hypothesis! Tao Zhexuan strongly recommends new papers from MIT and Oxford, and the 37-year-old Fields Medal winner participated Aug 05, 2024 pm 03:32 PM

Recently, the Riemann Hypothesis, known as one of the seven major problems of the millennium, has achieved a new breakthrough. The Riemann Hypothesis is a very important unsolved problem in mathematics, related to the precise properties of the distribution of prime numbers (primes are those numbers that are only divisible by 1 and themselves, and they play a fundamental role in number theory). In today's mathematical literature, there are more than a thousand mathematical propositions based on the establishment of the Riemann Hypothesis (or its generalized form). In other words, once the Riemann Hypothesis and its generalized form are proven, these more than a thousand propositions will be established as theorems, which will have a profound impact on the field of mathematics; and if the Riemann Hypothesis is proven wrong, then among these propositions part of it will also lose its effectiveness. New breakthrough comes from MIT mathematics professor Larry Guth and Oxford University

The first Mamba-based MLLM is here! Model weights, training code, etc. have all been open source Jul 17, 2024 am 02:46 AM

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com. Introduction In recent years, the application of multimodal large language models (MLLM) in various fields has achieved remarkable success. However, as the basic model for many downstream tasks, current MLLM consists of the well-known Transformer network, which

LLM is really not good for time series prediction. It doesn't even use its reasoning ability. Jul 15, 2024 pm 03:59 PM

Can language models really be used for time series prediction? According to Betteridge's Law of Headlines (any news headline ending with a question mark can be answered with "no"), the answer should be no. The fact seems to be true: such a powerful LLM cannot handle time series data well. Time series, that is, time series, as the name suggests, refers to a set of data point sequences arranged in the order of time. Time series analysis is critical in many areas, including disease spread prediction, retail analytics, healthcare, and finance. In the field of time series analysis, many researchers have recently been studying how to use large language models (LLM) to classify, predict, and detect anomalies in time series. These papers assume that language models that are good at handling sequential dependencies in text can also generalize to time series.

See all articles