


ACL 2024 | In the mathematical evaluation of 25 open and closed source models, GPT-3.5-Turbo barely passed

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com
Paper title: GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers Paper address: https://arxiv.org/pdf/2402.19255 Paper homepage: https: //qtli.github.io/GSM-Plus/
Numeric substitution: replacing numeric values with the same digits and types, such as Replace "16" with "20" in the question. Digit expansion: Increase the number of digits in a value, for example, replace "16" with "1600". Integer - Decimal - Fraction conversion: Replace integers with decimals or fractions, for example convert "2" to "2.5".
Operation expansion: Add restrictions to the original problem. For example, add a new condition "She also uses two eggs to make homemade hair masks every day." Operation reversal: Convert a known condition of the original problem into the variables to be solved for the GSM-Plus variant problem. For example, the statement of the original question in Figure 2 "2 US dollars per duck egg" is converted into the interrogative sentence of the new question "What is the price of each duck egg?", while the interrogative sentence of the original question "How many dollars do you earn at the farmer's market every day?" is converted into Known conditions for the new problem "She earns $18 a day at the farmer's market"
- GSM-Plus Features
Fine-grained evaluation
: Compared with GSM8K, the problem variant of GSM-Plus is more challenging, and the performance of all LLMs participating in the evaluation drops significantly. In the following analysis, this article will specifically analyze the problem-solving robustness of LLMs under different types of perturbations.
-
Table 1: Different colors represent different perturbation types: numeric substitution,
digit expansion,
integer-decimal-fraction conversion,
operation expansion,
operation reversal,
Problem understanding,
Distractor insertion,
Critical thinking.
As can be seen from the above table, previous studies used different perturbations to test the robustness of mathematical reasoning, but the evaluation settings only cover some perturbation types, and most of them introduce perturbations through automatic method construction, quality Hard to guarantee. In contrast, GSM-Plus uses eight different mathematical reasoning skills to perturb a single problem, with more comprehensive coverage and strict quality control. Experimental Analysis Evaluation Metrics Performance Reduction Rate (PDR): Performance of LLMs on the perturbed problem compared to the original problem fall degree. Percentage of simultaneously solved problem pairs (ASP): The proportion of the original problem and its corresponding problem variant that are both answered correctly by LLMs.
Overall Performance As shown in the table below, the performance of most LLMs on GSM-Plus drops significantly compared to GSM8K. GPT-4 shows the highest robustness, with the smallest PDR of only 8.23%. CodeLlama has the largest PDR, among which the 7B, 13B and 34B models are 40.56%, 39.71% and 34.27% respectively, exceeding its base model LLaMA-2-7B (39.49%), as well as the mathematical SFT model fine-tuned on it. , such as SEGO-7B (34.91%). This shows that reasoning using only procedural languages is vulnerable to perturbations. In the face of mathematical perturbations, the larger the model size, the more stable the performance. Although supervised fine-tuning can improve accuracy on downstream tasks, it does not significantly enhance the model's robustness to perturbations (i.e., lower PDR). Data that supervises fine-tuning is important for robustness. It is also fine-tuned based on LLaMA-2 and uses different data, which will lead to large differences in the accuracy and robustness of the model. Table 2: Overall performance Performance of LLMs under disturbance This paper further evaluates LLMs in 8 types of Performance stability under problem variants. Compared to human baseline for Critical Thinking (purple), Operation Expansion and Operation Reversal (blue), Distractor Insertion (pink), and Integer-Decimal-Fraction Conversion (orange) perturbation, the performance of LLMs decreases significantly. For "numeric replacement" and "problem understanding", the performance of LLMs is stable or even slightly improved. The previous analysis is mainly based on the entire data set. Next, this article splits the two data sets according to whether the math questions are answered correctly, and analyzes whether when LLMs successfully solve the GSM8K problem, it means that the probability of correctly answering the GSM-Plus variant question becomes higher (i.e., a high ASP value). vice versa. If this assertion holds true, LLMs can be considered to perform stably on this specific subset of mathematical problems, even if this is not the case on the entire data set. In the experimental setup, each GSM8K problem and its variants in GSM-Plus are transformed into 8 problem pairs, and the results are shown in Figure 4 . Figure 4: Inference transferability of LLMs between GSM8K and GSM-Plus problem pairs. Purple (both correct) and blue (both incorrect) bars indicate consistent model behavior, while red (GSM8K correct & GSM-Plus incorrect) and yellow (GSM8K incorrect & GSM-Plus correct) bars indicate Inconsistent model behavior. The sum of the heights of the purple and red bars represents the number of LLMs that correctly solved the GSM8K problem.
The presence of red bars (LLMs that correctly answer the original question, but do not address the variant), indicates that most models have limited performance transferability. Although the performance of LLMs differs on the GSM8K problem (height of purple and red bars), performance transferability is similar (height of red bars). This means that existing benchmarks cannot accurately assess a model's true capabilities in mathematical reasoning. High accuracy does not equate to strong inference robustness.Hints help in performance robustness of LLMs Previous work has shown that good hint instructions are important to stimulate the mathematical ability of language models. This article selects 4 representative models and tests their performance in solving problems under different prompt instructions. As shown in the figure below, when faced with interference, LLMs perform most stably when using complex examples as contextual demonstrations (Complexity-based CoT); in contrast, only using program language to represent intermediate reasoning (Program-of-Thought) , LLMs are more susceptible to interference. Overall, these tips and tricks are not enough for LLMs to maintain the same performance as GSM8K on GSM-Plus. L Figure 5: The impact of prompt on LLMS performance robustness Is the combination prompt valid?
This article found that LLMs often ignore important conditions or make calculation errors during the problem solving process. To this end, this paper explores Comp, a combined prompting method. The method first prompts LLMs to extract numerically relevant necessary conditions in the problem (Prompt1). Next, based on the problem and critical conditions, LLMs are instructed to iteratively generate inference goals (Prompt2) and calculation goals (Prompt3), and let them provide feedback on the generated historical problem-solving steps to determine whether the final answer is obtained (Prompt4). The specific implementation is shown in Figure 6.How to enhance the robustness of LLMs based on existing hint methods?
It can be seen that Comp can improve the performance of LLMs under various problem change types through iterative generation and self-verification, but it still The performance gap between LLMs on standard test sets and adversarial test sets cannot be bridged. This research looks forward to more methods in the future to further improve the robustness of the model and promote the further development of LLMs in the field of mathematical reasoning. Jadual 3: Prestasi lelaran Comp menggesa Plus untuk menulis semula soalan, di bawah teknik gesaan yang berbeza Prestasi GPT-3.5-Turbo. Walaupun semua gesaan mendorong Turbo untuk menjawab soalan GSM8K dengan tepat, hanya Comp yang dapat membantu Turbo menjana jawapan yang betul pada soalan varian GSM-Plus. Artikel ini memperkenalkan set penilaian soalan aplikasi matematik sekolah rendah yang bermusuhan GSM -Plus, yang direka bentuk untuk menyelesaikan masalah matematik secara sistematik bagi LLM. Analisis percubaan mendapati bahawa prestasi kebanyakan LLM menurun dengan ketara berbanding prestasi mereka pada penanda aras standard apabila berhadapan dengan gangguan, jauh lebih rendah daripada tahap prestasi manusia. Para penyelidik berharap bahawa kerja artikel ini dapat mempromosikan lebih banyak penyelidikan masa depan, termasuk tetapi tidak terhad kepada: (1) penilaian sistematik kemahiran matematik LLMs; (2) membina model yang boleh melakukan penaakulan matematik secara fleksibel. Pautan rujukan[1] Cobbe, Karl, et al. com/sota/aritmetik-penaakulan-pada-gsm8k [2] George Polya 2004. Cara menyelesaikannya: Aspek baharu kaedah matematik, jilid 85. Princeton university press..
The above is the detailed content of ACL 2024 | In the mathematical evaluation of 25 open and closed source models, GPT-3.5-Turbo barely passed. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics











It is also a Tusheng video, but PaintsUndo has taken a different route. ControlNet author LvminZhang started to live again! This time I aim at the field of painting. The new project PaintsUndo has received 1.4kstar (still rising crazily) not long after it was launched. Project address: https://github.com/lllyasviel/Paints-UNDO Through this project, the user inputs a static image, and PaintsUndo can automatically help you generate a video of the entire painting process, from line draft to finished product. follow. During the drawing process, the line changes are amazing. The final video result is very similar to the original image: Let’s take a look at a complete drawing.

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com The authors of this paper are all from the team of teacher Zhang Lingming at the University of Illinois at Urbana-Champaign (UIUC), including: Steven Code repair; Deng Yinlin, fourth-year doctoral student, researcher

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com In the development process of artificial intelligence, the control and guidance of large language models (LLM) has always been one of the core challenges, aiming to ensure that these models are both powerful and safe serve human society. Early efforts focused on reinforcement learning methods through human feedback (RL

cheers! What is it like when a paper discussion is down to words? Recently, students at Stanford University created alphaXiv, an open discussion forum for arXiv papers that allows questions and comments to be posted directly on any arXiv paper. Website link: https://alphaxiv.org/ In fact, there is no need to visit this website specifically. Just change arXiv in any URL to alphaXiv to directly open the corresponding paper on the alphaXiv forum: you can accurately locate the paragraphs in the paper, Sentence: In the discussion area on the right, users can post questions to ask the author about the ideas and details of the paper. For example, they can also comment on the content of the paper, such as: "Given to

If the answer given by the AI model is incomprehensible at all, would you dare to use it? As machine learning systems are used in more important areas, it becomes increasingly important to demonstrate why we can trust their output, and when not to trust them. One possible way to gain trust in the output of a complex system is to require the system to produce an interpretation of its output that is readable to a human or another trusted system, that is, fully understandable to the point that any possible errors can be found. For example, to build trust in the judicial system, we require courts to provide clear and readable written opinions that explain and support their decisions. For large language models, we can also adopt a similar approach. However, when taking this approach, ensure that the language model generates

Recently, the Riemann Hypothesis, known as one of the seven major problems of the millennium, has achieved a new breakthrough. The Riemann Hypothesis is a very important unsolved problem in mathematics, related to the precise properties of the distribution of prime numbers (primes are those numbers that are only divisible by 1 and themselves, and they play a fundamental role in number theory). In today's mathematical literature, there are more than a thousand mathematical propositions based on the establishment of the Riemann Hypothesis (or its generalized form). In other words, once the Riemann Hypothesis and its generalized form are proven, these more than a thousand propositions will be established as theorems, which will have a profound impact on the field of mathematics; and if the Riemann Hypothesis is proven wrong, then among these propositions part of it will also lose its effectiveness. New breakthrough comes from MIT mathematics professor Larry Guth and Oxford University

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com. Introduction In recent years, the application of multimodal large language models (MLLM) in various fields has achieved remarkable success. However, as the basic model for many downstream tasks, current MLLM consists of the well-known Transformer network, which

Can language models really be used for time series prediction? According to Betteridge's Law of Headlines (any news headline ending with a question mark can be answered with "no"), the answer should be no. The fact seems to be true: such a powerful LLM cannot handle time series data well. Time series, that is, time series, as the name suggests, refers to a set of data point sequences arranged in the order of time. Time series analysis is critical in many areas, including disease spread prediction, retail analytics, healthcare, and finance. In the field of time series analysis, many researchers have recently been studying how to use large language models (LLM) to classify, predict, and detect anomalies in time series. These papers assume that language models that are good at handling sequential dependencies in text can also generalize to time series.
