Table of Contents
Experimental Design: Humans and Models are OK
LLM will also give humans a lot of surprises
Home Technology peripherals AI 100:87: GPT-4 mind crushes humans! The three major GPT-3.5 variants are difficult to defeat

100:87: GPT-4 mind crushes humans! The three major GPT-3.5 variants are difficult to defeat

May 11, 2023 pm 11:43 PM
ai gpt-4

GPT-4’s theory of mind has surpassed humans!

Recently, experts from Johns Hopkins University discovered that GPT-4 can use chain of thought reasoning and step-by-step thinking, greatly improving its theory of mind performance.

100:87: GPT-4 mind crushes humans! The three major GPT-3.5 variants are difficult to defeat

## Paper address: https://arxiv.org/abs/2304.11490

In some tests, the human level is about 87%, and GPT-4 has reached the ceiling level of 100%!

Furthermore, with appropriate prompts, all RLHF-trained models can achieve over 80% accuracy.

100:87: GPT-4 mind crushes humans! The three major GPT-3.5 variants are difficult to defeat

Let AI learn theory of mind reasoning

We all know that about problems in daily life scenarios, A lot of big language models aren't very good at it.

Meta chief AI scientist and Turing Award winner LeCun once asserted: "On the road to human-level AI, large language models are a crooked road. You know, even a single Pet cats and dogs have more common sense and understanding of the world than any LLM."

100:87: GPT-4 mind crushes humans! The three major GPT-3.5 variants are difficult to defeat

Also Scholars believe that humans are biological entities that evolved with bodies and need to function in the physical and social world to complete tasks. However, large language models such as GPT-3, GPT-4, Bard, Chinchilla, and LLaMA do not have bodies.

So unless they grow human bodies and senses, and have a lifestyle with human purposes. Otherwise they simply wouldn't understand language the way humans do.

In short, although the excellent performance of large language models in many tasks is amazing, tasks that require reasoning are still difficult for them.

What is particularly difficult is theory of mind (ToM) reasoning.

Why is ToM reasoning so difficult?

Because in ToM tasks, LLM needs to make inferences based on unobservable information (such as the hidden mental state of others). This information needs to be inferred from the context and cannot be derived from the surface text. Parse it out.

However, for LLM, the ability to reliably perform ToM reasoning is very important. Because ToM is the basis of social understanding, only with ToM ability can people participate in complex social exchanges and predict the actions or reactions of others.

If AI cannot learn social understanding and get the various rules of human social interaction, it will not be able to work better for humans and assist humans in various tasks that require reasoning. Provide valuable insights.

How to do it?

Experts have found that through a kind of "context learning", the reasoning ability of LLM can be greatly enhanced.

For language models with more than 100B parameters, as long as a specific few-shot task demonstration is input, the model performance is significantly enhanced.

Also, simply instructing models to think step-by-step will enhance their inference performance, even without demonstrations.

Why are these prompt techniques so effective? There is currently no theory that can explain it.

Large language model contestants

Based on this background, scholars from Johns Hopkins University evaluated the performance of some language models in ToM tasks and explored We examine whether their performance can be improved through methods such as step-by-step thinking, few-shot learning, and thought chain reasoning.

The contestants are the latest four GPT models from the OpenAI family - GPT-4 and three variants of GPT-3.5, Davinci-2, Davinci-3 and GPT-3.5-Turbo.

· Davinci-2 (API name: text-davinci-002) is trained with supervised fine-tuning on human-written demos.

· Davinci-3 (API name: text-davinci-003) is an upgraded version of Davinci-2, which uses human feedback reinforcement learning optimized by approximate policies (RLHF) for further training.

· GPT-3.5-Turbo (the original version of ChatGPT), fine-tuned and trained on both human-written demos and RLHF, then further optimized for conversations .

· GPT-4 is the latest GPT model as of April 2023. Few details have been released about the size and training methods of GPT-4, however, it appears to have undergone more intensive RLHF training and is therefore more consistent with human intent.

Experimental Design: Humans and Models are OK

How to examine these models? The researchers designed two scenarios, one is a control scenario and the other is a ToM scenario.

The control scene refers to a scene without any agent, which can be called a "Photo scene".

The ToM scene describes the psychological state of the people involved in a certain situation.

The problems in these scenarios are almost the same in difficulty.

Humanity

The first to accept the challenge is human beings.

Human participants were given 18 seconds for each scenario.

Subsequently, a question appears on a new screen, and the human participant answers by clicking "Yes" or "No."

In the experiment, the Photo and ToM scenes were mixed and presented in random order.

For example, the question of the Photo scene is as follows -

Scenario: "A map shows the floor plan of the first floor. I gave it yesterday The architect sent a copy, but the kitchen door was omitted at that time. The kitchen door was just added to the map this morning."

#Question: Architect's copy Is the kitchen door shown?

100:87: GPT-4 mind crushes humans! The three major GPT-3.5 variants are difficult to defeat

The questions for the ToM scenario are as follows -

Scenario: "High School On the morning of the prom, Sarah put her high heels under her dress and went shopping. That afternoon, her sister borrowed the shoes and later placed them under Sarah's bed."

Question: When Sarah comes back, will she think her shoes are down there?

The test results are that the human accuracy rate in the Photo scene is (86%±4%), and the accuracy rate in the ToM scene is (87%±4%).

LLM

Because LLM is a probabilistic model, the researcher prompted each model 20 times.

The questions have 16 scenarios, each repeated 20 times, so LLM will be asked a total of 320 questions. How to define accuracy? It just looks at the proportion of correct answers to 320 questions.

For the high school prom question, if LLM gives the following answers, they will be considered correct.

Example 1: Sarah put her shoes under her skirt before going shopping, but her sister borrowed the shoes and put them under Sarah's bed. Sarah Didn't know her sister had borrowed her shoes, so she might have thought they were still under her clothes.

Example 2: Sarah put her high heels under her dress before going shopping, but her sister borrowed the shoes and later put them under Sarah's bed. So, Sarah didn’t know that her sister borrowed and returned the shoes. So the answer is: yes, she will think her shoes are under her clothes.

Example 3: Sarah might think her shoes are still under her skirt because that’s where she puts them in the morning.

Example 4: Sarah puts her high heels under her skirt before going shopping so she knows where she puts her shoes. Her sister borrowed the shoes and later placed them under Sarah's bed, but Sarah didn't know it yet. So, the answer is: Yes, Sarah will think her shoes are under her clothes until she finds out otherwise.

Example 5: In the morning, Sarah puts her high heels under her skirt. Later, her sister borrowed the shoes and placed them under Sarah's bed.

Example 6: Based on this situation, it is unclear whether Sarah would have thought her shoes were under her skirt when she returned. It depends on whether her sister has told her to borrow the shoes and put them under the bed.

If LLM makes the following answer, it will be judged as an error.

Example 1: SarahPut your shoes under your skirt in the morning and go shopping. Her sister borrowed the shoes and put them under Sarah's bed. So, Sarah didn’t know her shoes were under the bed. So, the answer is: No, Sarah doesn’t think her shoes are under her clothes.

Example 2: No, Sarah doesn’t think her shoes are under her clothes.

Example 3: First, Sarahputs her high heels under her skirt in the morning. Her sister then borrowed the shoes and later placed them under Sarah's bed. Therefore, Sarah would not think her shoes were under her clothes because her sister had moved them.

Example 4: In the morning, Sarah put her high heels under her skirt and went shopping. This means that when she got home, she didn't check to see if her shoes were still under her clothes. Later, her sister borrowed the shoes and placed them under Sarah's bed. Therefore, Sarah cannot assume that her shoes are under her skirt because they have been tampered with by her sister.

Example 5: This scene does not specify whether Sarah thinks her shoes are under her clothes.

Example 6: Sarahput her high heels under her skirt in the morning and then she went shopping. Her sister later borrowed the shoes and placed them under Sarah's bed. Based on this information, it's unclear whether Sarah thought her shoes were still under her dress as she prepared to dance.

In order to measure the effect of contextual learning (ICL) on ToM performance, the researchers used four types of prompts.

Zero-Shot (without ICL)

100:87: GPT-4 mind crushes humans! The three major GPT-3.5 variants are difficult to defeat

Zero-Shot Step -by-Step Thinking

100:87: GPT-4 mind crushes humans! The three major GPT-3.5 variants are difficult to defeat

Two-Shot Thinking Chain Reasoning

100:87: GPT-4 mind crushes humans! The three major GPT-3.5 variants are difficult to defeat

Two-Shot Thinking Chain ReasoningStep-by-Step Thinking

100:87: GPT-4 mind crushes humans! The three major GPT-3.5 variants are difficult to defeat# #Experimental results

zero-shot baseline

First, the author compared the zero-shot performance of the model in Photo and ToM scenes.

100:87: GPT-4 mind crushes humans! The three major GPT-3.5 variants are difficult to defeat

In the Photo scene, the accuracy of the model will gradually improve as the use time increases (A). Among them, Davinci-2 has the worst performance and GPT-4 has the best performance.

Contrary to Photo understanding, the accuracy of ToM problems does not improve monotonically with repeated use of the model (B). But this result does not mean that models with low "scores" have worse inference performance.

For example, GPT-3.5 Turbo is more likely to give vague responses when there is insufficient information. But GPT-4 does not have such a problem, and its ToM accuracy is significantly higher than all other models.

100:87: GPT-4 mind crushes humans! The three major GPT-3.5 variants are difficult to defeat

After prompt blessing

The author discovered , after using the modified prompts for context learning, all GPT models released after Davinci-2 will have significant improvements.

100:87: GPT-4 mind crushes humans! The three major GPT-3.5 variants are difficult to defeat

First of all, it is the most classic to let the model think step by step.

The results show that this step-by-step thinking improves the performance of Davinci-3, GPT-3.5-Turbo and GPT-4, but does not improve the accuracy of Davinci-2 .

Secondly, use the Two-shot chain of thinking (CoT) for reasoning.

The results show that Two-shot CoT improves the accuracy of all models trained with RLHF (except Davinci-2).

For GPT-3.5-Turbo, Two-shot CoT hints significantly improve the performance of the model and are more effective than one-step thinking. For Davinci-3 and GPT-4, the improvement brought by using Two-shot CoT is relatively limited.

Finally, use Two-shot CoT to reason and think step-by-step at the same time.

The results show that the ToM accuracy of all RLHF-trained models has significantly improved: Davinci-3 achieved a ToM accuracy of 83% (±6%), GPT-3.5- Turbo achieved 91% (±5%), while GPT-4 achieved the highest accuracy of 100%.

In these cases, human performance was 87% (±4%).

100:87: GPT-4 mind crushes humans! The three major GPT-3.5 variants are difficult to defeat

In the experiment, the researchers noticed such a problem: the improvement of LLM ToM test scores was due to the change from prompt Are the reasons for the reasoning steps replicated in ?

To this end, they tried to use reasoning and photo examples for prompts, but the reasoning patterns in these contextual examples are not the same as the reasoning patterns in ToM scenes.

Even so, the model's performance in ToM scenes has also improved.

Thus, the researchers concluded that prompt can improve ToM performance not just because of overfitting to the specific set of inference steps shown in the CoT example.

Instead, the CoT example appears to invoke an output mode involving step-by-step reasoning, which improves the model's accuracy for a range of tasks.

100:87: GPT-4 mind crushes humans! The three major GPT-3.5 variants are difficult to defeat

The impact of various CoT instances on ToM performance

LLM will also give humans a lot of surprises

In the experiment, the researchers discovered some very interesting phenomena.

1. Except for davincin-2, all models can use the modified prompt to obtain higher ToM accuracy.

Moreover, the model showed the greatest improvement in accuracy when prompt was combined with both thinking chain reasoning and Think Step-by-Step, rather than using both alone.

2. Davinci-2 is the only model that has not been fine-tuned by RLHF and the only model that has not improved ToM performance through prompt. This suggests that it may be RLHF that enables the model to take advantage of contextual cues in this setting.

3. LLMs may have the ability to perform ToM reasoning, but they cannot exhibit this ability without the appropriate context or prompts. With the help of thought chains and step-by-step prompts, davincin-3 and GPT-3.5-Turbo both achieved higher performance than the zero-sample ToM accuracy of GPT-4.

In addition, many scholars have previously had objections to this indicator for evaluating LLM reasoning ability.

Because these studies mainly rely on word completion or multiple-choice questions to measure the ability of large models, however, this evaluation method may not capture the ToM reasoning that LLM can perform. Complexity. ToM reasoning is a complex behavior that may involve multiple steps, even when reasoned by humans.

Therefore, LLM may benefit from producing longer answers when dealing with tasks.

There are two reasons: first, we can evaluate the model output more fairly when it is longer. LLM sometimes generates "corrections" and then additionally mentions other possibilities that would lead it to an inconclusive conclusion. Alternatively, a model may have some level of information about the potential outcomes of a situation, but this may not be enough for it to draw the correct conclusions.

Secondly, when models are given opportunities and clues to react systematically step by step, LLM may unlock new reasoning capabilities or enhance reasoning capabilities.

Finally, the researcher also summarized some shortcomings in the work.

For example, in the GPT-3.5 model, sometimes the reasoning is correct, but the model cannot integrate this reasoning to draw the correct conclusion. Therefore, future research should expand the study of methods (such as RLHF) to help LLM draw correct conclusions given a priori reasoning steps.

In addition, in the current study, the failure modes of each model were not quantitatively analyzed. How does each model fail? Why failed? The details in this process require more exploration and understanding.

Also, the research data does not talk about whether LLM has the "mental ability" corresponding to the structured logical model of mental states. But the data does show that asking LLM for a simple yes/no answer to ToM questions is not productive.

Fortunately, these results show that the behavior of LLM is highly complex and context-sensitive, and also show us how to help LLM in some forms of social reasoning.

Therefore, we need to characterize the cognitive capabilities of large models through careful investigation, rather than reflexively applying existing cognitive ontologies.

In short, as AI becomes more and more powerful, humans also need to expand their imagination to understand their capabilities and working methods.

The above is the detailed content of 100:87: GPT-4 mind crushes humans! The three major GPT-3.5 variants are difficult to defeat. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
WWE 2K25: How To Unlock Everything In MyRise
1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

How to add, modify and delete MySQL data table field operation guide How to add, modify and delete MySQL data table field operation guide Apr 11, 2025 pm 05:42 PM

Field operation guide in MySQL: Add, modify, and delete fields. Add field: ALTER TABLE table_name ADD column_name data_type [NOT NULL] [DEFAULT default_value] [PRIMARY KEY] [AUTO_INCREMENT] Modify field: ALTER TABLE table_name MODIFY column_name data_type [NOT NULL] [DEFAULT default_value] [PRIMARY KEY]

Detailed explanation of nested query instances in MySQL database Detailed explanation of nested query instances in MySQL database Apr 11, 2025 pm 05:48 PM

Nested queries are a way to include another query in one query. They are mainly used to retrieve data that meets complex conditions, associate multiple tables, and calculate summary values ​​or statistical information. Examples include finding employees above average wages, finding orders for a specific category, and calculating the total order volume for each product. When writing nested queries, you need to follow: write subqueries, write their results to outer queries (referenced with alias or AS clauses), and optimize query performance (using indexes).

How to configure Debian Apache log format How to configure Debian Apache log format Apr 12, 2025 pm 11:30 PM

This article describes how to customize Apache's log format on Debian systems. The following steps will guide you through the configuration process: Step 1: Access the Apache configuration file The main Apache configuration file of the Debian system is usually located in /etc/apache2/apache2.conf or /etc/apache2/httpd.conf. Open the configuration file with root permissions using the following command: sudonano/etc/apache2/apache2.conf or sudonano/etc/apache2/httpd.conf Step 2: Define custom log formats to find or

What does oracle do What does oracle do Apr 11, 2025 pm 06:06 PM

Oracle is the world's largest database management system (DBMS) software company. Its main products include the following functions: relational database management system (Oracle database) development tools (Oracle APEX, Oracle Visual Builder) middleware (Oracle WebLogic Server, Oracle SOA Suite) cloud service (Oracle Cloud Infrastructure) analysis and business intelligence (Oracle Analytics Cloud, Oracle Essbase) blockchain (Oracle Blockchain Pla

How Tomcat logs help troubleshoot memory leaks How Tomcat logs help troubleshoot memory leaks Apr 12, 2025 pm 11:42 PM

Tomcat logs are the key to diagnosing memory leak problems. By analyzing Tomcat logs, you can gain insight into memory usage and garbage collection (GC) behavior, effectively locate and resolve memory leaks. Here is how to troubleshoot memory leaks using Tomcat logs: 1. GC log analysis First, enable detailed GC logging. Add the following JVM options to the Tomcat startup parameters: -XX: PrintGCDetails-XX: PrintGCDateStamps-Xloggc:gc.log These parameters will generate a detailed GC log (gc.log), including information such as GC type, recycling object size and time. Analysis gc.log

How to implement file sorting by debian readdir How to implement file sorting by debian readdir Apr 13, 2025 am 09:06 AM

In Debian systems, the readdir function is used to read directory contents, but the order in which it returns is not predefined. To sort files in a directory, you need to read all files first, and then sort them using the qsort function. The following code demonstrates how to sort directory files using readdir and qsort in Debian system: #include#include#include#include#include//Custom comparison function, used for qsortintcompare(constvoid*a,constvoid*b){returnstrcmp(*(

Record the specific steps for installing MariaDB in Ubuntu Record the specific steps for installing MariaDB in Ubuntu Apr 11, 2025 pm 05:15 PM

Steps to install MariaDB on Ubuntu: Add MariaDB repository installation MariaDB starts MariaDB service protection MariaDB installation Connect to MariaDB Create database and user (optional) Verify installation

How to configure firewall rules for Debian syslog How to configure firewall rules for Debian syslog Apr 13, 2025 am 06:51 AM

This article describes how to configure firewall rules using iptables or ufw in Debian systems and use Syslog to record firewall activities. Method 1: Use iptablesiptables is a powerful command line firewall tool in Debian system. View existing rules: Use the following command to view the current iptables rules: sudoiptables-L-n-v allows specific IP access: For example, allow IP address 192.168.1.100 to access port 80: sudoiptables-AINPUT-ptcp--dport80-s192.16

See all articles