Table of Contents
Conclusion
Home Technology peripherals AI Move the entrance exam questions into the Chinese large model data set, 20477 questions, and 4 candidate answers

Move the entrance exam questions into the Chinese large model data set, 20477 questions, and 4 candidate answers

May 27, 2023 pm 09:13 PM
data Model

As Chinese large-scale language models have demonstrated strong performance in natural language understanding and natural language generation, the existing Chinese evaluation benchmark data sets for specific natural language processing tasks are no longer sufficient to evaluate large-scale Chinese models. Evaluate effectively. Traditional Chinese evaluation benchmarks mainly focus on the model's ability to understand simple common sense (such as needing to bring an umbrella when going out on a rainy day) and superficial semantics (such as whether the basketball game report is sports or technology news), while ignoring the mining and utilization of complex human knowledge. . At present, there is a lack of data sets for complex knowledge evaluation of large Chinese models, especially when it comes to professional knowledge at different levels and in different fields under our country’s education system.

In order to bridge this gap, Tianjin University Natural Language Processing Laboratory and Huawei Noah's Ark Laboratory jointly released M3KE (A Massive Multi-Level Multi-Subject Knowledge Evaluation Benchmark for Chinese Large Language Models) benchmark data set, which tests the ability of Chinese large models to master multi-level and multi-disciplinary knowledge in the form of zero samples and few samples.

Move the entrance exam questions into the Chinese large model data set, 20477 questions, and 4 candidate answers


  • ##Paper link: https://arxiv .org/abs/2305.10263
  • Data link: https://github.com/tjunlp-lab/M3KE
M3KE Dataset

Dataset Introduction

M3KE collected 20,477 real-life standardized test questions (including 4 candidate answers), covering 71 tasks, including elementary school, junior high school, high school, university, and graduate entrance examination questions, involving humanities, history, politics, law, education, psychology, science, engineering technology, art and other disciplines, the distribution is as shown in Fig. 1 shown.

Move the entrance exam questions into the Chinese large model data set, 20477 questions, and 4 candidate answers

Researchers constructed the M3KE data set based on two criteria:

1, in line with the Chinese education system, covering multiple education stages

The researchers imitated the educational experience of Chinese students, That is, primary education, junior high school, high school, university and other major education stages, aiming to evaluate the performance of the Chinese large model at different education stages. Since the knowledge points that need to be mastered at each educational stage are different (for example, in the Chinese subject, there are obvious differences in the knowledge or test points between primary school and junior high school), therefore, M3KE will include the same subjects at different educational stages. In order to improve the coverage of subject knowledge points in the data set, the researchers selected the unified examination questions in China's entrance examinations, including real questions from primary school to junior high school, high school entrance examination, college entrance examination, graduate entrance examination and Chinese civil service examination.

2, covering multi-disciplinary fields

#In order to improve the subject coverage of the data set, researchers based on humanities and arts It is constructed into three major categories: literature, science, history, politics, law, education, psychology, science, engineering technology, art and other disciplines. To further expand the richness of the data set, the researchers added tasks such as traditional Chinese medicine, religion, and computer grade examinations.

Dataset Statistics

Table 3 shows the overall statistics of M3KE. The number of tasks in the above four subject categories are 12, 21, 31 and 7 respectively, while the number of questions in the four subject categories are 3,612, 6,222, 8,162 and 2,126 respectively. The maximum number of questions included in a task is 425, and the minimum number is 100. Questions in social sciences and natural sciences are generally longer than questions in arts and humanities and other subjects, while their answer options are shorter.

Move the entrance exam questions into the Chinese large model data set, 20477 questions, and 4 candidate answers

Introduction and examples of M3KE from a multidisciplinary perspective

Humanities and Arts

The humanities and arts disciplines include subjects in multiple fields such as Chinese, art, and history. These subjects focus on the analysis and interpretation of literary and cultural artifacts. Taking primary school Chinese as an example, the test questions are designed to assess the language use and literary appreciation abilities of students aged 7 to 13, such as the ability to use synonyms and antonyms. The history subject covers Chinese and world history from ancient times to modern times. In addition to humanities, M3KE also includes art subjects, such as dance, art, music, film, etc. Art is an important part of human culture, and it is equally important to evaluate the performance of Chinese large models in the art field.

Art task example:

Which of the following statements about the Lascaux cave paintings is incorrect?

A. This mural was discovered in France

B. There are more than 100 animal images found

C. The time of discovery was 1940

D. The color of the mural is mainly black

World Modern History Mission Example:

It took more than two centuries from the Dutch Revolution to the French Revolution, and only half a century after that, capitalism initially formed a world system. This is mainly because?

A. The influence of the French Revolution was widely spread

B. The Vienna System intensified social conflicts in various countries

C. The Industrial Revolution rapidly increased the power of capitalism

D. Colonial rule spread across all continents of the world

Society Science

# Social science focuses on the application of humanities, such as law, politics, education, and psychology. Political courses run through multiple education stages including junior high school, high school, university, and postgraduate education, while other subjects are mainly distributed in university-level courses. Social sciences also include economics and management tasks. The test questions for these tasks are selected from the Economics Joint Examination and the Management Joint Examination in the Chinese Graduate Entrance Examination. The knowledge involves microeconomics, macroeconomics, management, logic, etc.

Criminal Law Task Example:

A wants to kill B, so he puts poison into B’s food. After B took it, A regretted it and quickly explained the situation and sent B to the hospital. During the inspection, the hospital found that the "poison" administered by A was not toxic at all, and B was safe and sound. A’s behavior belongs to?

A. Does not constitute a crime

B. Attempted crime

C. Crime discontinued

D. Completed crime

Principles of education task example:

The most basic in educational research , What is the most commonly used research method?

A. Educational observational research

B. Educational survey research

C. Educational measurement Research

D. Educational Experimental Research

Natural Science

Natural sciences include engineering, science, medicine and basic subjects such as mathematics, physics, chemistry and biology. These subjects often require complex computational, analytical and logical reasoning skills. In our country’s education system, the same subject involves different types of knowledge at different stages. For example, primary school mathematics focuses on learning basic arithmetic operations, while high school mathematics covers more advanced mathematical concepts such as sequences, derivatives, geometry, etc.

Animal Physiology Task Example:

Using procaine to anesthetize nerve fibers affects which characteristic of nerve fiber conduction excitation?

A. Physiological integrity

B. Insulation

C. Bidirectional conductivity

D. Relatively fatigue-free

Operating system task example:

Directory format has a great impact on file retrieval efficiency Large, what is the most advanced directory form below?

A. Single-level directory

B. Two-level directory

C. Three-level directory Directory

D. Tree directory

Others

##Others Types of tasks include religion, Chinese civil service exam, computer grade exam, etc. These tasks require knowledge that is not limited to the single level or discipline described above. For example, the Chinese civil service examination involves knowledge such as general knowledge, humanities, and logic, so researchers regard these tasks as an assessment of comprehensive knowledge of the Chinese large model.

Chinese Civil Service Examination Task Example:

Several previous studies have shown that eating chocolate increases the likelihood of heart disease in those who eat it. A new, more reliable study concludes that chocolate consumption is not associated with heart disease rates. It is estimated that after the results of this research are released, the consumption of chocolate will increase significantly. The above inference is based on which of the following assumptions?

A. Some people eat chocolate even though they know it increases the likelihood of heart disease

B. People I have never believed that eating chocolate will make you more likely to suffer from heart disease

C. Now many people eat chocolate because they have not heard that chocolate can cause heart disease

D. Nowadays, many people do not eat chocolate simply because they believe that chocolate can induce heart disease

Traditional Chinese Medicine Task Example:

Ginseng has the effect of replenishing vitality and replenishing qi, but what medicine is often used as a substitute for chronic debilitating diseases?

Salvia

Codonopsis pilosula

Astragalus

太子神

Introduction and examples of M3KE from the perspective of multiple education stages

The researchers divided the data set into stages according to the Chinese education system, including primary school, junior high school, High school, college and graduate entrance exams. Similarly, researchers also choose some examination subjects outside the education system, such as computer grade examinations and Chinese civil service examinations.

##Primary school

Example of Chinese language tasks for primary school:

The following words Which one is completely correct in writing?

A. The sound of nature, the flowing clouds and flowing water, the pen and the dragon and the snake, rummaging through boxes and cabinets

B. The mountains and flowing water, singing and dancing, the finishing touch, unique ideas

C. The sound lingers, the skills are clever, the pen is full of flowers, restless

D. Huang Zhongda Lu is vivid, lifelike, elite troops and reduced government

#Primary school math task example:

The price of a product is first increased by 20%, and then reduced by 20%. How does the current price compare with the original price?

A. Improved

B. Reduced

C. Unchanged

D. Don’t know

Junior high school

Example of Chinese language tasks for junior high school:

Which of the following statements is correct?

A. "The Most Painful and the Most Happy" is selected from "Selected Works of Liang Qichao". The author Liang Qichao is a thinker and scholar in the Ming Dynasty

B. " "Zou Ji satirizes the King of Qi and accepts advice" is selected from "Warring States Policy". "Warring States Policy" is a compilation of the strategies and opinions of lobbyists during the Warring States Period. It was compiled into thirty-three chapters by Liu Xiang of the Eastern Han Dynasty

C. Words are also called "long and short sentences", and sentence patterns vary in length. It flourished in the Song Dynasty. Su Shi and Xin Qiji were representatives of the bold school, while Li Qingzhao was a representative of the graceful school. , which embodies the author’s idea of ​​having fun with the people

Example of political tasks in junior high schools:

The class should be produced with the theme of “advocating the spirit of the rule of law” Xiaolan is responsible for writing the content of the "Practice Equality" section of the Blackboard newspaper. Which of the following materials she collected is suitable for selection?

A. There are special love seats on the bus for "old, weak, sick and pregnant women"

B. Middle school students go to the revolutionary traditional education base to participate Study activities

C. People's Liberation Army soldiers braved severe cold and heat to guard the borders of the motherland

D. Students used holidays to clear small advertisements on the streets

High School

Example of high school Chinese language task:

Shen Kuo in " "Mengxi Bi Tan" said: "The changes of heaven and earth, cold and heat, wind and rain, floods, droughts, locusts, all have laws." What is the philosophical meaning of this sentence?

A. Laws are the root cause of changes in objective things

B. Laws are objective and universal

C. Learn to look at problems from the perspective of connection

D. Learn to look at issues from the perspective of development

High School Example of biological task:

Environmental capacity depends on the environmental conditions of a population. Which of the following statements is correct?

The environmental capacity of the gray magpie populations in two places must be the same

The East Asian migratory locusts living in a certain grassland in different years The environmental capacity may be the same

When the population approaches the environmental capacity, the death rate will increase and the birth rate remains unchanged

Life The environmental holding capacity of crucian carp and snakehead fish in Weishan Lake is the same

大学

University of Stomatology Mission Example:

Which oral cancer ranks first in our country?

A. Alveolar mucosal cancer

#B. Buccal mucosal cancer

C. Lip Cancer

D. Tongue cancer

Example of comprehensive university economics assignment:

The following items Which item should be included in GDP?

A. Government transfer payment

B. Purchase of a used car

C. Loan and bond interest paid by the business

D. 10,000 yuan won from buying lottery tickets

Others

## Example of computer basic tasks for computer grade examination:

Because there is a lot of data in a worksheet, the title of the first row cannot always be seen when scrolling. What should I do to always see the title row? What is the fastest way?

A. Set "Print Title"

B. Freeze Pane

C. Freeze the first row

D. Freeze the first column

Religious mission example:

Religion can What is the political basis suitable for a socialist society?

A. The establishment of the people's democratic dictatorship state power

#B. The majority of believers support the socialist system and share the fundamental interests of the people of the country It is unanimous on

C. The establishment of the leadership and ruling status of the Communist Party of China

D. Be independent and run your own church

Experiment

Evaluation model

    ##GLM-335M/10B/130B, developed by Tsinghua University Pre-trained large language model, supporting Chinese and English bilingual. The researchers chose three models of the Chinese version of GLM, with parameter sizes of 335M, 10B and 130B respectively.
  • BLOOM-7.1B, a multi-language large model launched by Hugging Face, was developed by hundreds of researchers.
  • ChatGLM-6B, a language model developed at Tsinghua University, is fine-tuned using instruction data and further trained through reinforcement learning based on human feedback.
  • MOSS-16B-SFT, a language model developed by Fudan University, the instruction-fine-tuned version of MOSS-moon-003-SFT was used in the experiment.
  • BELLE-7B-0.2M, based on the language model developed by BLOOMZ-7.1B-mt and fine-tuned with 200,000 instructions.
  • BELLE-7B-2M, based on the language model developed by BLOOMZ-7.1B-mt and fine-tuned with 2 million instructions.
  • GPT-3.5-turbo, a language model developed by OpenAI. Human feedback reinforcement learning training is performed using artificially constructed high-quality instruction data.

Zero-shot/Few-shot evaluation

Model requirements under zero-sample setting Answer the question directly; under the condition of few-sample settings, the model will be given several examples of the same task in advance to guide the model to perform in-context learning. In M3KE, all questions are scored using accuracy.

Evaluation results under different subject categories

Move the entrance exam questions into the Chinese large model data set, 20477 questions, and 4 candidate answers


Move the entrance exam questions into the Chinese large model data set, 20477 questions, and 4 candidate answers

##Evaluation results under different education stages

Move the entrance exam questions into the Chinese large model data set, 20477 questions, and 4 candidate answers

##Analysis of results

1. In zero-sample evaluation (Table 4&6), the accuracy of all pre-trained language models (without fine-tuning) with parameters less than 10B is lower than random results (25%). The settings with few samples (Table 5&7) helps improve model performance. However, the results of GLM130B in zero-sample evaluation are better than those of few-sample evaluation. The reason may be that GLM130B has used part of the instruction data in the pre-training stage, so that it already has better zero-sample learning capabilities.

2, most of the fine-tuned Chinese large models only reach the level of random results (25%), even in the primary school level test (Table 6&7). This shows that knowledge in lower education levels is still one of the shortcomings of the current large Chinese model.

#3. In the zero-sample evaluation, BELLE-7B-2M achieved the best results among the Chinese large models, but still had a 14.8% gap with GPT-3.5-turbo. In addition, the number of supervised fine-tuning instructions is also an important factor. BELLE-7B-2M fine-tuned with two million instructions is better than BELLE-7B-0.2M fine-tuned with two hundred thousand instructions (Table 4).

4, the setting of few samples does not bring performance improvement in most cases (Table 5&7 vs Table 4&6), especially after instruction fine-tuning or reinforcement learning based on human feedback The trained language model. This shows that instruction fine-tuning of a pre-trained language model can significantly improve the zero-shot learning ability of the language model, which does not require additional examples to understand the intent of the instruction or question.

Conclusion

Researchers proposed a new benchmark, M3KE, to evaluate the knowledge mastery of Chinese large models in multiple disciplines and different educational stages. . M3KE contains 71 tasks and 20,447 questions. The researchers found that all large open-source Chinese models evaluated significantly lagged behind GPT-3.5. The researchers hope that M3KE will help discover knowledge loopholes in Chinese large models and promote the further development of Chinese large models.

All tasks in M3KE

Move the entrance exam questions into the Chinese large model data set, 20477 questions, and 4 candidate answers

The above is the detailed content of Move the entrance exam questions into the Chinese large model data set, 20477 questions, and 4 candidate answers. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Repo: How To Revive Teammates
1 months ago By 尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island Adventure: How To Get Giant Seeds
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Open source! Beyond ZoeDepth! DepthFM: Fast and accurate monocular depth estimation! Open source! Beyond ZoeDepth! DepthFM: Fast and accurate monocular depth estimation! Apr 03, 2024 pm 12:04 PM

0.What does this article do? We propose DepthFM: a versatile and fast state-of-the-art generative monocular depth estimation model. In addition to traditional depth estimation tasks, DepthFM also demonstrates state-of-the-art capabilities in downstream tasks such as depth inpainting. DepthFM is efficient and can synthesize depth maps within a few inference steps. Let’s read about this work together ~ 1. Paper information title: DepthFM: FastMonocularDepthEstimationwithFlowMatching Author: MingGui, JohannesS.Fischer, UlrichPrestel, PingchuanMa, Dmytr

The world's most powerful open source MoE model is here, with Chinese capabilities comparable to GPT-4, and the price is only nearly one percent of GPT-4-Turbo The world's most powerful open source MoE model is here, with Chinese capabilities comparable to GPT-4, and the price is only nearly one percent of GPT-4-Turbo May 07, 2024 pm 04:13 PM

Imagine an artificial intelligence model that not only has the ability to surpass traditional computing, but also achieves more efficient performance at a lower cost. This is not science fiction, DeepSeek-V2[1], the world’s most powerful open source MoE model is here. DeepSeek-V2 is a powerful mixture of experts (MoE) language model with the characteristics of economical training and efficient inference. It consists of 236B parameters, 21B of which are used to activate each marker. Compared with DeepSeek67B, DeepSeek-V2 has stronger performance, while saving 42.5% of training costs, reducing KV cache by 93.3%, and increasing the maximum generation throughput to 5.76 times. DeepSeek is a company exploring general artificial intelligence

KAN, which replaces MLP, has been extended to convolution by open source projects KAN, which replaces MLP, has been extended to convolution by open source projects Jun 01, 2024 pm 10:03 PM

Earlier this month, researchers from MIT and other institutions proposed a very promising alternative to MLP - KAN. KAN outperforms MLP in terms of accuracy and interpretability. And it can outperform MLP running with a larger number of parameters with a very small number of parameters. For example, the authors stated that they used KAN to reproduce DeepMind's results with a smaller network and a higher degree of automation. Specifically, DeepMind's MLP has about 300,000 parameters, while KAN only has about 200 parameters. KAN has a strong mathematical foundation like MLP. MLP is based on the universal approximation theorem, while KAN is based on the Kolmogorov-Arnold representation theorem. As shown in the figure below, KAN has

Hello, electric Atlas! Boston Dynamics robot comes back to life, 180-degree weird moves scare Musk Hello, electric Atlas! Boston Dynamics robot comes back to life, 180-degree weird moves scare Musk Apr 18, 2024 pm 07:58 PM

Boston Dynamics Atlas officially enters the era of electric robots! Yesterday, the hydraulic Atlas just "tearfully" withdrew from the stage of history. Today, Boston Dynamics announced that the electric Atlas is on the job. It seems that in the field of commercial humanoid robots, Boston Dynamics is determined to compete with Tesla. After the new video was released, it had already been viewed by more than one million people in just ten hours. The old people leave and new roles appear. This is a historical necessity. There is no doubt that this year is the explosive year of humanoid robots. Netizens commented: The advancement of robots has made this year's opening ceremony look like a human, and the degree of freedom is far greater than that of humans. But is this really not a horror movie? At the beginning of the video, Atlas is lying calmly on the ground, seemingly on his back. What follows is jaw-dropping

AI subverts mathematical research! Fields Medal winner and Chinese-American mathematician led 11 top-ranked papers | Liked by Terence Tao AI subverts mathematical research! Fields Medal winner and Chinese-American mathematician led 11 top-ranked papers | Liked by Terence Tao Apr 09, 2024 am 11:52 AM

AI is indeed changing mathematics. Recently, Tao Zhexuan, who has been paying close attention to this issue, forwarded the latest issue of "Bulletin of the American Mathematical Society" (Bulletin of the American Mathematical Society). Focusing on the topic "Will machines change mathematics?", many mathematicians expressed their opinions. The whole process was full of sparks, hardcore and exciting. The author has a strong lineup, including Fields Medal winner Akshay Venkatesh, Chinese mathematician Zheng Lejun, NYU computer scientist Ernest Davis and many other well-known scholars in the industry. The world of AI has changed dramatically. You know, many of these articles were submitted a year ago.

The vitality of super intelligence awakens! But with the arrival of self-updating AI, mothers no longer have to worry about data bottlenecks The vitality of super intelligence awakens! But with the arrival of self-updating AI, mothers no longer have to worry about data bottlenecks Apr 29, 2024 pm 06:55 PM

I cry to death. The world is madly building big models. The data on the Internet is not enough. It is not enough at all. The training model looks like "The Hunger Games", and AI researchers around the world are worrying about how to feed these data voracious eaters. This problem is particularly prominent in multi-modal tasks. At a time when nothing could be done, a start-up team from the Department of Renmin University of China used its own new model to become the first in China to make "model-generated data feed itself" a reality. Moreover, it is a two-pronged approach on the understanding side and the generation side. Both sides can generate high-quality, multi-modal new data and provide data feedback to the model itself. What is a model? Awaker 1.0, a large multi-modal model that just appeared on the Zhongguancun Forum. Who is the team? Sophon engine. Founded by Gao Yizhao, a doctoral student at Renmin University’s Hillhouse School of Artificial Intelligence.

Slow Cellular Data Internet Speeds on iPhone: Fixes Slow Cellular Data Internet Speeds on iPhone: Fixes May 03, 2024 pm 09:01 PM

Facing lag, slow mobile data connection on iPhone? Typically, the strength of cellular internet on your phone depends on several factors such as region, cellular network type, roaming type, etc. There are some things you can do to get a faster, more reliable cellular Internet connection. Fix 1 – Force Restart iPhone Sometimes, force restarting your device just resets a lot of things, including the cellular connection. Step 1 – Just press the volume up key once and release. Next, press the Volume Down key and release it again. Step 2 – The next part of the process is to hold the button on the right side. Let the iPhone finish restarting. Enable cellular data and check network speed. Check again Fix 2 – Change data mode While 5G offers better network speeds, it works better when the signal is weaker

The U.S. Air Force showcases its first AI fighter jet with high profile! The minister personally conducted the test drive without interfering during the whole process, and 100,000 lines of code were tested for 21 times. The U.S. Air Force showcases its first AI fighter jet with high profile! The minister personally conducted the test drive without interfering during the whole process, and 100,000 lines of code were tested for 21 times. May 07, 2024 pm 05:00 PM

Recently, the military circle has been overwhelmed by the news: US military fighter jets can now complete fully automatic air combat using AI. Yes, just recently, the US military’s AI fighter jet was made public for the first time and the mystery was unveiled. The full name of this fighter is the Variable Stability Simulator Test Aircraft (VISTA). It was personally flown by the Secretary of the US Air Force to simulate a one-on-one air battle. On May 2, U.S. Air Force Secretary Frank Kendall took off in an X-62AVISTA at Edwards Air Force Base. Note that during the one-hour flight, all flight actions were completed autonomously by AI! Kendall said - "For the past few decades, we have been thinking about the unlimited potential of autonomous air-to-air combat, but it has always seemed out of reach." However now,

See all articles