ICLR 2024 Spotlight | No need to worry about intermediate steps, MUSTARD can generate high-quality mathematical inference data-AI-php.cn

ICLR 2024 Spotlight | 无惧中间步骤，MUSTARD可生成高质量数学推理数据

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com.

In recent years, large language models (LLM) have made great progress in tasks such as mathematical application problems and mathematical theorem proving. Mathematical reasoning requires a rigorous, formalized multi-step reasoning process and is therefore a key milestone in the advancement of LLMs' reasoning capabilities, but it still faces important challenges.

Previous research work, such as Chain of Thoughts (CoT), revealed the effectiveness of intermediate stepsguidance. However, manually annotating such intermediate steps requires a lot of manpower and time costs, and automatically synthesized data is also prone to problems in correctness and human readability.

In this article, researchers from City University of Hong Kong, Sun Yat-sen University, Huawei Noah's Ark Laboratory and other institutions propose a unified mathematical reasoning data synthesis frameworkMUSTARD, which can generate a large number of correct and human Readable and understandable high-quality mathematical reasoning data.

Paper title: MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data
Paper link: https://openreview.net/forum?id=8xliOUg9EW
Code link: https:/// /github.com/Eleanor-H/MUSTARD
Dataset link: https://drive.google.com/file/d/1yIVAVqpkC2Op7LhisG6BJJ_-MavAMr1B/view
Author homepage: https://eleanor- h.github.io/

A high-quality data synthesis framework utilizing formal provers

MUSTARD framework consists of three phases:

Phase 1, Concept acquisition : Firstly, a mathematical concept library is defined and established, covering concepts in the four educational stages of primary school, junior high school, high school and higher education. Each educational stage has 5 to 9 mathematical areas, covering different types of mathematical problems such as algebra and geometry. . Each area contains broken down mathematical concepts, such as polynomial operations or factoring. Then one or more mathematical concepts are extracted from the mathematical concept database as seeds to specify the generated question categories.

The second stage, data generation: Prompt large language models to generate mathematical problems and multi-step solution processes based on mathematical concepts. Specifically, MUSTARD leverages the ability of large language models to generate natural language and code, prompting large language models to complete three tasks: (T1) Generate mathematical problems related to a given concept; (T2) Give solutions to the problems in natural language ; (T3) Automatic formalization, converting natural language solution into Lean 3 formal solution.

The third stage, Formal verification: Use the verification of the interactive formal theorem prover to screen out the accurate solution process. After MUSTARD delivers the formal solution of Lean 3 to the Lean formal verifier, if the theorem prover does not return error information, the corresponding data will be collected into the valid set. Otherwise, MUSTARD collects error messages from the theorem prover and prompts the language model to modify the formal solution. MUSTARD performs multiple rounds of verification and self-correction until a valid formal solution is obtained. The MUSTARD framework consists of three stages: concept collection, data generation and formal verification.

Human evaluation of data quality

In order to explore the quality of the data generated by MUSTARD, the research team asked professionals with mastery of mathematics and Lean 3 language to perform a quality check on the data. They randomly selected 200 items from the generated data, of which 100 items passed the verification of the Lean theorem prover (valid group) and 100 items did not pass the verification (invalid group). The quality check covers four parts of each piece of data (i.e., natural language problem description, natural language solution, formal problem description, and formal solution), including checks for correctness and consistency. Specifically, high-quality data should have correct natural language problem description (D1) and correct problem solving (D4). Formal problem description and solution should be consistent with natural language problem description and solution (D5, D6). In addition, the data should conform to the specified mathematical concepts (D2) and problem types (D3). Table 3 shows these six inspection dimensions and requirements. If the data meets the requirements, it gets a score of 1 in the dimension, otherwise it gets a score of 0.

Table 3 shows the accuracy and corresponding p-value of the effective group and the invalid group in each dimension. The significant difference between (D1) and (D4) illustrates the correctness of the questions and answers generated by MUSTARD. The significant difference in (D6) indicates the high consistency between the natural language description and the formal description of the generated data.

Effectiveness of data on the mathematical reasoning ability of the model

To evaluate the impact of MUSTARDSAUCE on improving mathematical reasoning ability, the research team used these data to fine-tune a smaller-scale language model and performed It is assessed on Mathematical Word Problems (MWP) and Automatic Theorem Proving (ATP). This article compares the effectiveness of the following combined data of the MUSTARDSAUCE data set:

MUSTARDSAUCE-valid: 5866 pieces of data verified by the Lean formal prover;
MUSTARDSAUCE-invalid: failed to pass Lean 5866 pieces of data verified by the formal prover;
MUSTARDSAUCE-random: 5866 pieces of random data;
MUSTARDSAUCE-tt: all 28316 pieces of data generated by MUSTARD.

The research team adopted LoRA [1] to fine-tune the open source GPT2-large [2], Llama 2-7B and Llama 2-70B [3] on each combined data. For mathematical word problem tasks, they used the GSM8K [4] and MATH [5][6] datasets for evaluation. When evaluating automated theorem proving, the research team used the Mathlib [8] and miniF2F [7] benchmarks. Furthermore, they were also evaluated on the MUSTARDSAUCE-test.

In general, fine-tuning the model on MUSTARDSAUCE improves the mathematical reasoning capabilities of the model. In automatic theorem proving (Table 5 below) and mathematical application problem solving (Table 4 below), using MUSTARDSAUCE-valid for fine-tuning compared with using MUSTARDSAUCE-random for fine-tuning, the average relative performance increased by 18.15% (Table 5 below) and 11.01% % (Table 4 below).

For automatic theorem proving, the average performance improvement of fine-tuned Llama 2-7B is 15.41%, and the average performance improvement of fine-tuned GPT 2-large is 20.89%.

For solving mathematical application problems, the average performance of fine-tuned Llama 2-7B is improved by 8.18%, and the average performance of fine-tuned GPT 2-large is improved by 15.41%. In addition, although the model fine-tuned with MUSTARDSAUCE-tt has an absolute advantage in the amount of fine-tuned data, its performance is not as good as the model fine-tuned with MUSTARDSAUCE-valid. More results for

Llama 2-70B. The MUSTARDSAUCE data remains valid when fine-tuning larger language models.

MUSTARDSAUCE data set

This article open sourced the MUSTARDSAUCE data set. Each of the data contains problem description and multi-step solution in natural language, as well as problem description and multi-step solution in the dual formal language Lean 3. The data of MUSTARDSAUCE includes mathematical application questions and theorem proving questions, covering difficulty levels from primary school to higher education. The number of reasoning steps for a question increases with the difficulty of the question. The most difficult questions require about 30 steps to solve, and about 20 Lean 3 tactics.

데이터 세트 다운로드: https://drive.google.com/file/d/1yIVAVqpkC2Op7LhisG6BJJ_-MavAMr1B/view

자동 형식화/비공식화 챌린지

연구 팀은 또한 MUSTARDSAUCE 데이터 세트의 자연어와 Lean 형식 언어의 이중 데이터를 기반으로 자동 형식화(autoformalization) 및 자동 비공식화(auto-informalization) 챌린지를 진행합니다. 또한 연구팀은 자동화된 정리 생성 및 증명, 코드를 이용한 자동 최적화 문제 해결이라는 두 가지 챌린지 트랙을 동시에 개설했습니다. 대회 기간은 2024년 4월 3일부터 5월 27일까지입니다. 우승팀은 7월 26일 오스트리아 비엔나에서 열리는 수학을 위한 ICML 2024 AI 워크숍에 참가할 기회를 갖게 됩니다.

트랙 1-1(자동 비공식화): https://www.codabench.org/competitions/2436/
트랙 1-2(자동 비공식화): https: //www.codabench .org/competitions/2484/
트랙 2(자동 정리 생성 및 증명): https://www.codabench.org/competitions/2437/
트랙 3(운영 연구 최적화의 코드 지원 자동 솔루션 문제): https://www.codabench.org/competitions/2438/

^{참고 자료:}

^{[1] Edward J Hu, Yelong Shen , Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang 및 Weizhu Chen. arXiv 사전 인쇄 arXiv:2106.09685, 2021.}

^{[2] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever 등 언어 모델은 비지도 다중 작업 학습자입니다. OpenAI 블로그, 1 (8):9, 2019.}

^{[3] Hugo Touvron, Louis Martin, Kevin Stone , Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko-lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton-Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux , Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aure ́lien Rodriguez, Robert Stojnic , Sergey Edunov 및 Thomas Scialom 2: 개방형 기반 및 미세 조정된 채팅 모델, abs/2307.09288, 2023. doi: 10.48550/arXiv.2307.09288 URL https://doi. .}

^{[4] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse 및 John Schulman 교육 검증자. CoRR, abs/2110.14168, 2021.}

^{[5] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song 및 Jacob Steinhardt. MATH 데이터 세트를 사용하여 해결. In Joaquin Vanschoren 및 Sai-Kit Yeung(eds.), 데이터 세트 및 벤치마크에 대한 신경 정보 처리 시스템 트랙 1, NeurIPS 데이터 세트 및 벤치마크 2021, 2021년 12월, 가상, 2021.}

^{[6] Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever 및 Karl Cobbe arXiv 사전 인쇄 arXiv:2305.20050, 2023. .}

^{[7] Kunhao Zheng, Jesse Michael Han 및 Stanislas Polu: 공식 올림피아드 수준 수학에 대한 교차 시스템 벤치마크, ICLR 2022, 가상 이벤트, 2022년 4월 25-29일. OpenReview.net, 2022.}

[8] https://github.com/leanprover-community/mathlib

The above is the detailed content of ICLR 2024 Spotlight | No need to worry about intermediate steps, MUSTARD can generate high-quality mathematical inference data. For more information, please follow other related articles on the PHP Chinese website!