As language models become more and more capable, the existing evaluation benchmarks are a bit childish, and the performance of some tasks is far behind humans.
An important feature of general artificial intelligence (AGI) is the model’s generalization ability to handle human-level tasks, while traditional benchmarks that rely on artificial datasets do not accurately represent humans. ability.
Recently, Microsoft researchers released a new benchmark AGIEval, specifically used to evaluate the performance of basic models in "human-centric" Performance on standardized tests, such as the College Entrance Examination, Civil Service Examination, Law School Admission Test, Mathematics Competition, and Bar Examination.
Paper link: https://arxiv.org/pdf/2304.06364.pdf
Data link: https://github.com/microsoft/AGIEval
The researchers evaluated using the AGIEval benchmark Three state-of-the-art basic models, including GPT-4, ChatGPT and Text-Davinci-003, experimental results found that GPT-4's performance in SAT, LSAT and mathematics competitions exceeded the average human level, and the accuracy of the SAT mathematics test reached The accuracy rate of the Chinese College Entrance Examination English test reached 92.5%, indicating the extraordinary performance of the current basic model.
But GPT-4 is less adept at tasks that require complex reasoning or domain-specific knowledge, as our comprehensive analysis of model capabilities (understanding, knowledge, reasoning, and computation) reveals. Model strengths and limitations.
In recent years, large-scale basic models such as GPT-4 have shown very powerful capabilities in various fields, and can assist humans in processing daily events, and even It can also provide decision-making advice in professional fields such as law, medicine and finance.
In other words, artificial intelligence systems are gradually approaching and achieving artificial general intelligence (AGI).
But as AI gradually integrates into daily life, how to evaluate the human-centered generalization ability of models, identify potential flaws, and ensure that they can effectively handle complex, human-centered tasks, and Assessing reasoning skills to ensure reliability and trustworthiness in different contexts is critical.
The researchers constructed the AGIEval data set mainly following two design principles:
1. Emphasis on human brain level Cognitive Tasks
#The main goal of "human-centered" design is to center on tasks closely related to human cognition and problem solving, and in a more Assess the generalization ability of the underlying model in a meaningful and comprehensive manner.
To achieve this goal, the researchers selected a variety of official, public, high-standard admissions and qualifying exams that meet the needs of general human test takers, including college admissions exams, law school admissions exams, math exams, bar exams, and state civil service exams that are taken every year by millions of people seeking to enter higher education or a new career path.
By adhering to these officially recognized standards for assessing human-level capabilities, AGIEval ensures that assessments of model performance are directly related to human decision-making and cognitive abilities.
2. Relevance to real-world scenarios
By selecting from high standards The tasks of entrance and qualifying examinations ensure that assessment results reflect the complexity and practicality of the challenges that individuals often encounter in different fields and contexts.
This approach not only measures the performance of the model in terms of human cognitive abilities, but also provides a better understanding of applicability and effectiveness in real life, i.e. helps in development Develop artificial intelligence systems that are more reliable, more practical, and more suitable for solving a wide range of real-world problems.
# Based on the above design principles, researchers selected a variety of standardized, high-quality exams that emphasize human-level reasoning and real-world relevance, Specifically include:
1. General College Entrance Examination
College Entrance Examination includes various A subject that requires critical thinking, problem-solving and analytical skills and is ideal for assessing the performance of large language models in relation to human cognition.
Specifically includes the Graduate Record Examination (GRE), the Academic Assessment Test (SAT) and the Chinese College Entrance Examination (Gaokao), which can assess the general abilities and subject-specific knowledge of students seeking admission to higher education institutions. .
The data set collects exams corresponding to the 8 subjects of the Chinese College Entrance Examination: history, mathematics, English, Chinese, geography, biology, chemistry and physics; select math questions from the GRE; English and mathematics subjects were selected from the SAT to construct a benchmark data set.
2. Law School Admission Test
Law School Admission Test, such as LSAT, Designed to measure the reasoning and analytical abilities of future law students, the exam includes sections such as logical reasoning, reading comprehension, and analytical reasoning. It requires test takers to analyze complex information and draw accurate conclusions. These tasks can assess the role of language models in legal reasoning. and analytical skills.
#3. The Bar Examination
can evaluate the legal proficiency of an individual pursuing a legal career Knowledge, analytical skills and ethical understanding. The exam covers a wide range of legal topics, including constitutional law, contract law, criminal law and property law, and requires candidates to demonstrate their ability to effectively apply legal principles and reasoning. This test can demonstrate professional legal knowledge and ethical judgment. Evaluate the performance of language models in the context of
4. Graduate Management Admission Test (GMAT)
GMAT is a standardized The exam can assess the analytical, quantitative, verbal and comprehensive reasoning abilities of future business school graduate students. It consists of analytical writing assessment, comprehensive reasoning, quantitative reasoning and verbal reasoning. It evaluates the test taker's critical thinking, analyzing data and effective communication. Ability.
5. High School Mathematics Competitions
These competitions cover a wide range of mathematical topics, Includes number theory, algebra, geometry, and combinatorics, and often presents non-routine problems that require creative solutions.
Specifically includes the American Mathematics Competition (AMC) and the American Invitational Mathematics Examination (AIME), which can test students’ mathematical ability, creativity and problem-solving ability, and can further evaluate language model processing Ability to solve complex and creative mathematical problems, and the ability of models to generate novel solutions.
#6. The Domestic Civil Service Examination
can assess the qualifications of individuals seeking entry into the civil service Competencies and skills, the examination includes assessment of general knowledge, reasoning ability, language skills, and expertise in specific subjects related to the roles and responsibilities of various civil service positions in China. It can measure the performance of language models in the context of public administration, and their Potential for policy development, decision-making and public service delivery processes.
The selected models include:
ChatGPT, Dialogue developed by OpenAI A new artificial intelligence model that can engage in user interactions and dynamic conversations, trained using a massive instruction data set and further tuned through reinforcement learning with human feedback (RLHF), enabling it to provide contextual and coherent content consistent with human expectations. reply.
GPT-4, as the fourth generation GPT model, contains a wider range of knowledge base and exhibits human-level performance in many application scenarios. GPT-4 was repeatedly tweaked using adversarial testing and ChatGPT, resulting in significant improvements in factuality, bootability, and compliance with the rules.
Text-Davinci-003 is an intermediate version between GPT-3 and GPT-4, which is better than GPT after fine-tuning through instructions -3 performs better.
In addition, the average score and the highest score of human test takers were also reported in the experiment as the human level limit for each task, but they do not fully represent what humans may have. Range of skills and knowledge.
Zero-shot/Few-shot evaluation
In the setting of zero samples, the model directly evaluates the problem Evaluation; in few-shot tasks, a small number of examples (such as 5) from the same task are input before evaluation on the test samples.
In order to further test the reasoning ability of the model, the chain of thought (CoT) prompt was also introduced in the experiment, that is, first enter the prompt "Let's think step by step" to generate an explanation for a given question. Then enter the prompt "Explanation is" to generate the final answer based on the explanation.
The "multiple-choice questions" in the benchmark use standard classification accuracy; the "fill-in-the-blank questions" use exact matching (EM ) and F1 indicator.
It can be found from the experimental results:
1. GPT-4 is significantly better than its similar products in all task settings, including 93.8% accuracy on Gaokao-English and 95% accuracy on SAT-MATH, indicating that GPT-4 has excellent general capabilities in handling human-centered tasks.
2. ChatGPT significantly outperforms Text-Davinci-003 in tasks that require external knowledge, such as those involving geography, biology, chemistry, physics, and mathematics , indicating that ChatGPT has a stronger knowledge base and is better able to handle tasks that require a deep understanding of a specific domain.
On the other hand, ChatGPT slightly outperforms Text-Davinci- in all assessment settings and in tasks that require pure understanding and do not rely heavily on external knowledge, such as English and LSAT tasks. 003, or equivalent results. This observation means that both models are capable of handling tasks centered on language understanding and logical reasoning without requiring specialized domain knowledge.
3. Although the overall performance of these models is good, all language models perform poorly in complex reasoning tasks, such as MATH and LSAT-AR , GK-physics, and GK-Math, highlighting the limitations of these models in handling tasks that require advanced reasoning and problem-solving skills.
The observed difficulties in handling complex inference problems provide opportunities for future research and development aimed at improving the model's general inference capabilities.
4. Compared with zero-shot learning, few-shot learning usually only brings limited performance improvements, indicating that the current zero-shot learning of large language models Shot learning capabilities are approaching few-shot learning capabilities, which also marks a big improvement over the original GPT-3 model, when few-shot performance was much better than zero-shot.
A reasonable explanation for this development is the enhancement of human adjustments and adjustments to instructions in current language models. These improvements allow the models to better understand the tasks ahead of time. meaning and context, thus allowing them to perform well even in zero-shot situations, proving the effectiveness of the instructions.
The above is the detailed content of The AI exam and the public exam are just around the corner! Microsoft Chinese team releases new benchmark AGIEval, specially designed for human examinations. For more information, please follow other related articles on the PHP Chinese website!