As Chinese large-scale language models have demonstrated strong performance in natural language understanding and natural language generation, the existing Chinese evaluation benchmark data sets for specific natural language processing tasks are no longer sufficient to evaluate large-scale Chinese models. Evaluate effectively. Traditional Chinese evaluation benchmarks mainly focus on the model's ability to understand simple common sense (such as needing to bring an umbrella when going out on a rainy day) and superficial semantics (such as whether the basketball game report is sports or technology news), while ignoring the mining and utilization of complex human knowledge. . At present, there is a lack of data sets for complex knowledge evaluation of large Chinese models, especially when it comes to professional knowledge at different levels and in different fields under our country’s education system.
In order to bridge this gap, Tianjin University Natural Language Processing Laboratory and Huawei Noah's Ark Laboratory jointly released M3KE (A Massive Multi-Level Multi-Subject Knowledge Evaluation Benchmark for Chinese Large Language Models) benchmark data set, which tests the ability of Chinese large models to master multi-level and multi-disciplinary knowledge in the form of zero samples and few samples.
Dataset Introduction
M3KE collected 20,477 real-life standardized test questions (including 4 candidate answers), covering 71 tasks, including elementary school, junior high school, high school, university, and graduate entrance examination questions, involving humanities, history, politics, law, education, psychology, science, engineering technology, art and other disciplines, the distribution is as shown in Fig. 1 shown.
Researchers constructed the M3KE data set based on two criteria:
1, in line with the Chinese education system, covering multiple education stages
The researchers imitated the educational experience of Chinese students, That is, primary education, junior high school, high school, university and other major education stages, aiming to evaluate the performance of the Chinese large model at different education stages. Since the knowledge points that need to be mastered at each educational stage are different (for example, in the Chinese subject, there are obvious differences in the knowledge or test points between primary school and junior high school), therefore, M3KE will include the same subjects at different educational stages. In order to improve the coverage of subject knowledge points in the data set, the researchers selected the unified examination questions in China's entrance examinations, including real questions from primary school to junior high school, high school entrance examination, college entrance examination, graduate entrance examination and Chinese civil service examination.
2, covering multi-disciplinary fields
#In order to improve the subject coverage of the data set, researchers based on humanities and arts It is constructed into three major categories: literature, science, history, politics, law, education, psychology, science, engineering technology, art and other disciplines. To further expand the richness of the data set, the researchers added tasks such as traditional Chinese medicine, religion, and computer grade examinations.
Dataset Statistics
Table 3 shows the overall statistics of M3KE. The number of tasks in the above four subject categories are 12, 21, 31 and 7 respectively, while the number of questions in the four subject categories are 3,612, 6,222, 8,162 and 2,126 respectively. The maximum number of questions included in a task is 425, and the minimum number is 100. Questions in social sciences and natural sciences are generally longer than questions in arts and humanities and other subjects, while their answer options are shorter.
Introduction and examples of M3KE from a multidisciplinary perspectiveHumanities and Arts
The humanities and arts disciplines include subjects in multiple fields such as Chinese, art, and history. These subjects focus on the analysis and interpretation of literary and cultural artifacts. Taking primary school Chinese as an example, the test questions are designed to assess the language use and literary appreciation abilities of students aged 7 to 13, such as the ability to use synonyms and antonyms. The history subject covers Chinese and world history from ancient times to modern times. In addition to humanities, M3KE also includes art subjects, such as dance, art, music, film, etc. Art is an important part of human culture, and it is equally important to evaluate the performance of Chinese large models in the art field.
Art task example:
Which of the following statements about the Lascaux cave paintings is incorrect?
A. This mural was discovered in France
B. There are more than 100 animal images found
C. The time of discovery was 1940
D. The color of the mural is mainly black
World Modern History Mission Example:
It took more than two centuries from the Dutch Revolution to the French Revolution, and only half a century after that, capitalism initially formed a world system. This is mainly because?
A. The influence of the French Revolution was widely spread
B. The Vienna System intensified social conflicts in various countries
C. The Industrial Revolution rapidly increased the power of capitalism
D. Colonial rule spread across all continents of the world
Society Science
# Social science focuses on the application of humanities, such as law, politics, education, and psychology. Political courses run through multiple education stages including junior high school, high school, university, and postgraduate education, while other subjects are mainly distributed in university-level courses. Social sciences also include economics and management tasks. The test questions for these tasks are selected from the Economics Joint Examination and the Management Joint Examination in the Chinese Graduate Entrance Examination. The knowledge involves microeconomics, macroeconomics, management, logic, etc.
Criminal Law Task Example:
A wants to kill B, so he puts poison into B’s food. After B took it, A regretted it and quickly explained the situation and sent B to the hospital. During the inspection, the hospital found that the "poison" administered by A was not toxic at all, and B was safe and sound. A’s behavior belongs to?
A. Does not constitute a crime
B. Attempted crime
C. Crime discontinued
D. Completed crime
Principles of education task example:
The most basic in educational research , What is the most commonly used research method?
A. Educational observational research
B. Educational survey research
C. Educational measurement Research
D. Educational Experimental Research
Natural Science
Natural sciences include engineering, science, medicine and basic subjects such as mathematics, physics, chemistry and biology. These subjects often require complex computational, analytical and logical reasoning skills. In our country’s education system, the same subject involves different types of knowledge at different stages. For example, primary school mathematics focuses on learning basic arithmetic operations, while high school mathematics covers more advanced mathematical concepts such as sequences, derivatives, geometry, etc.
Animal Physiology Task Example:
Using procaine to anesthetize nerve fibers affects which characteristic of nerve fiber conduction excitation?
A. Physiological integrity
B. Insulation
C. Bidirectional conductivity
D. Relatively fatigue-free
Operating system task example:
Directory format has a great impact on file retrieval efficiency Large, what is the most advanced directory form below?
A. Single-level directory
B. Two-level directory
C. Three-level directory Directory
D. Tree directory
Others
##Others Types of tasks include religion, Chinese civil service exam, computer grade exam, etc. These tasks require knowledge that is not limited to the single level or discipline described above. For example, the Chinese civil service examination involves knowledge such as general knowledge, humanities, and logic, so researchers regard these tasks as an assessment of comprehensive knowledge of the Chinese large model.
Chinese Civil Service Examination Task Example:
Several previous studies have shown that eating chocolate increases the likelihood of heart disease in those who eat it. A new, more reliable study concludes that chocolate consumption is not associated with heart disease rates. It is estimated that after the results of this research are released, the consumption of chocolate will increase significantly. The above inference is based on which of the following assumptions?
A. Some people eat chocolate even though they know it increases the likelihood of heart disease
B. People I have never believed that eating chocolate will make you more likely to suffer from heart disease
C. Now many people eat chocolate because they have not heard that chocolate can cause heart disease
D. Nowadays, many people do not eat chocolate simply because they believe that chocolate can induce heart disease
Traditional Chinese Medicine Task Example:
Ginseng has the effect of replenishing vitality and replenishing qi, but what medicine is often used as a substitute for chronic debilitating diseases?
Salvia
Codonopsis pilosula
Astragalus
太子神
Introduction and examples of M3KE from the perspective of multiple education stagesThe researchers divided the data set into stages according to the Chinese education system, including primary school, junior high school, High school, college and graduate entrance exams. Similarly, researchers also choose some examination subjects outside the education system, such as computer grade examinations and Chinese civil service examinations.
##Primary school
Example of Chinese language tasks for primary school:The following words Which one is completely correct in writing?
A. The sound of nature, the flowing clouds and flowing water, the pen and the dragon and the snake, rummaging through boxes and cabinets
B. The mountains and flowing water, singing and dancing, the finishing touch, unique ideas
C. The sound lingers, the skills are clever, the pen is full of flowers, restless
D. Huang Zhongda Lu is vivid, lifelike, elite troops and reduced government
#Primary school math task example:
The price of a product is first increased by 20%, and then reduced by 20%. How does the current price compare with the original price?
A. Improved
B. Reduced C. Unchanged D. Don’t know Junior high school Example of Chinese language tasks for junior high school: Which of the following statements is correct? A. "The Most Painful and the Most Happy" is selected from "Selected Works of Liang Qichao". The author Liang Qichao is a thinker and scholar in the Ming Dynasty B. " "Zou Ji satirizes the King of Qi and accepts advice" is selected from "Warring States Policy". "Warring States Policy" is a compilation of the strategies and opinions of lobbyists during the Warring States Period. It was compiled into thirty-three chapters by Liu Xiang of the Eastern Han Dynasty C. Words are also called "long and short sentences", and sentence patterns vary in length. It flourished in the Song Dynasty. Su Shi and Xin Qiji were representatives of the bold school, while Li Qingzhao was a representative of the graceful school. , which embodies the author’s idea of having fun with the people Example of political tasks in junior high schools: The class should be produced with the theme of “advocating the spirit of the rule of law” Xiaolan is responsible for writing the content of the "Practice Equality" section of the Blackboard newspaper. Which of the following materials she collected is suitable for selection? A. There are special love seats on the bus for "old, weak, sick and pregnant women" B. Middle school students go to the revolutionary traditional education base to participate Study activities C. People's Liberation Army soldiers braved severe cold and heat to guard the borders of the motherland D. Students used holidays to clear small advertisements on the streets High School Example of high school Chinese language task: Shen Kuo in " "Mengxi Bi Tan" said: "The changes of heaven and earth, cold and heat, wind and rain, floods, droughts, locusts, all have laws." What is the philosophical meaning of this sentence? A. Laws are the root cause of changes in objective things B. Laws are objective and universal C. Learn to look at problems from the perspective of connection D. Learn to look at issues from the perspective of development High School Example of biological task: Environmental capacity depends on the environmental conditions of a population. Which of the following statements is correct? The environmental capacity of the gray magpie populations in two places must be the same The East Asian migratory locusts living in a certain grassland in different years The environmental capacity may be the same When the population approaches the environmental capacity, the death rate will increase and the birth rate remains unchanged Life The environmental holding capacity of crucian carp and snakehead fish in Weishan Lake is the same 大学 University of Stomatology Mission Example: Which oral cancer ranks first in our country? A. Alveolar mucosal cancer #B. Buccal mucosal cancer C. Lip Cancer D. Tongue cancer Example of comprehensive university economics assignment: The following items Which item should be included in GDP? A. Government transfer payment B. Purchase of a used car C. Loan and bond interest paid by the business D. 10,000 yuan won from buying lottery tickets Others ## Example of computer basic tasks for computer grade examination: Because there is a lot of data in a worksheet, the title of the first row cannot always be seen when scrolling. What should I do to always see the title row? What is the fastest way? A. Set "Print Title" B. Freeze Pane C. Freeze the first row D. Freeze the first column Religious mission example: Religion can What is the political basis suitable for a socialist society? A. The establishment of the people's democratic dictatorship state power #B. The majority of believers support the socialist system and share the fundamental interests of the people of the country It is unanimous on C. The establishment of the leadership and ruling status of the Communist Party of China D. Be independent and run your own church Evaluation model Zero-shot/Few-shot evaluation Evaluation results under different subject categories
##GLM-335M/10B/130B, developed by Tsinghua University Pre-trained large language model, supporting Chinese and English bilingual. The researchers chose three models of the Chinese version of GLM, with parameter sizes of 335M, 10B and 130B respectively.
##Analysis of results
1. In zero-sample evaluation (Table 4&6), the accuracy of all pre-trained language models (without fine-tuning) with parameters less than 10B is lower than random results (25%). The settings with few samples (Table 5&7) helps improve model performance. However, the results of GLM130B in zero-sample evaluation are better than those of few-sample evaluation. The reason may be that GLM130B has used part of the instruction data in the pre-training stage, so that it already has better zero-sample learning capabilities. 2, most of the fine-tuned Chinese large models only reach the level of random results (25%), even in the primary school level test (Table 6&7). This shows that knowledge in lower education levels is still one of the shortcomings of the current large Chinese model. #3. In the zero-sample evaluation, BELLE-7B-2M achieved the best results among the Chinese large models, but still had a 14.8% gap with GPT-3.5-turbo. In addition, the number of supervised fine-tuning instructions is also an important factor. BELLE-7B-2M fine-tuned with two million instructions is better than BELLE-7B-0.2M fine-tuned with two hundred thousand instructions (Table 4). 4, the setting of few samples does not bring performance improvement in most cases (Table 5&7 vs Table 4&6), especially after instruction fine-tuning or reinforcement learning based on human feedback The trained language model. This shows that instruction fine-tuning of a pre-trained language model can significantly improve the zero-shot learning ability of the language model, which does not require additional examples to understand the intent of the instruction or question. Researchers proposed a new benchmark, M3KE, to evaluate the knowledge mastery of Chinese large models in multiple disciplines and different educational stages. . M3KE contains 71 tasks and 20,447 questions. The researchers found that all large open-source Chinese models evaluated significantly lagged behind GPT-3.5. The researchers hope that M3KE will help discover knowledge loopholes in Chinese large models and promote the further development of Chinese large models. All tasks in M3KEConclusion
The above is the detailed content of Move the entrance exam questions into the Chinese large model data set, 20477 questions, and 4 candidate answers. For more information, please follow other related articles on the PHP Chinese website!