Home > Technology peripherals > AI > body text

For questions that humans can score 92 points on, GPT-4 can only score 15 points on. Once the test is upgraded, all the large models appear in their original form.

PHPz
Release: 2023-11-27 13:37:13
forward
723 people have browsed it

GPT-4 has been a "top student" since its birth, and has scored high scores in various examinations (benchmarks). But now, it scored just 15 points in a new test, compared to 92 for humans.

This set of test questions called "GAIA" was produced by teams from Meta-FAIR, Meta-GenAI, HuggingFace and AutoGPT. It proposes some problems that require a series of basic abilities to solve. Questions such as reasoning, multimodal processing, web browsing, and general tool usage abilities. These problems are very simple for humans but extremely challenging for most advanced AI. If all the problems inside can be solved, the completed model will become an important milestone in AI research.

For questions that humans can score 92 points on, GPT-4 can only score 15 points on. Once the test is upgraded, all the large models appear in their original form.

GAIA’s design philosophy is different from many current AI benchmarks, which tend to design tasks that are increasingly difficult for humans. Tasks, this actually reflects the differences in the current community's understanding of AGI. The team behind GAIA believes that the emergence of AGI depends on whether the system can show robustness similar to that of ordinary people on the above-mentioned "simple" problems.

For questions that humans can score 92 points on, GPT-4 can only score 15 points on. Once the test is upgraded, all the large models appear in their original form.

The rewritten content is as follows: Image 1: Example of GAIA question. Completing these tasks requires large models with certain basic capabilities such as reasoning, multimodality, or tool usage. The answer is unambiguous and, by design, cannot be found in the plain text of the training data. Some problems come with additional evidence, such as pictures, which reflect real use cases and allow for better control of the problem

Although LLM is the most capable of successfully completing tasks that are difficult for humans to complete The performance of LLM on GAIA is unsatisfactory. Even equipped with tools, GPT4 had a success rate of no more than 30% on the easiest tasks and 0% on the hardest tasks. Meanwhile, the average success rate for human respondents was 92%.

Therefore, if a system can solve the problem in GAIA, we can evaluate it in t-AGI system. t-AGI is a detailed AGI evaluation system built by OpenAI engineer Richard Ngo, which includes 1-second AGI, 1-minute AGI, 1-hour AGI, etc. It is used to examine whether an AI system can perform within a limited time. Complete tasks that humans can usually complete in the same amount of time. The authors say that on the GAIA test, humans typically take about 6 minutes to answer the simplest questions and about 17 minutes to answer the most complex questions.

For questions that humans can score 92 points on, GPT-4 can only score 15 points on. Once the test is upgraded, all the large models appear in their original form.

The author used the GAIA method to design 466 questions and their answers. They released a developer set with 166 questions and answers, and an additional 300 questions that didn't come with answers. This benchmark is published in the form of a ranking list

For questions that humans can score 92 points on, GPT-4 can only score 15 points on. Once the test is upgraded, all the large models appear in their original form.

  • Ranking address: https://huggingface.co/spaces/gaia -benchmark/leaderboard
  • Paper address: https://arxiv.org/pdf/2311.12983.pdf
  • HuggingFace homepage address: https://huggingface.co/papers/2311.12983

##What is GAIA

How does GAIA work? GAIA is a benchmark for testing artificial intelligence systems on the general assistant problem, the researchers said. GAIA attempts to circumvent the shortcomings of a large number of previous LLM assessments. This benchmark consists of 466 questions designed and annotated by humans. The questions are text-based, and some are accompanied by files (such as images or spreadsheets). They cover a variety of tasks of an auxiliary nature, including daily personal tasks, science and general knowledge etc.

These questions have a short, single and easily verifiable correct answer

If you want to use GAIA, you only need to ask questions to the artificial intelligence assistant zero sample and attach relevant evidence (if any). Achieving a perfect score on the GAIA requires a range of different basic abilities. The creators of this project provide various questions and metadata in their supplementary materials

GAIA was born out of both the need to upgrade artificial intelligence benchmarks and the currently widely observed shortcomings of LLM evaluation.

The first principle in designing GAIA is to target conceptually simple problems. Although these problems may be tedious to humans, they are ever-changing in the real world and are challenging for current artificial intelligence systems. This allows us to focus on fundamental capabilities, such as rapid adaptation through reasoning, multimodal understanding, and potentially diverse tool usage, rather than on specialized skills.

These problems typically include finding and Transform information gathered from disparate sources, such as provided documentation or the open and ever-changing web, to produce accurate answers. To answer the example question in Figure 1, an LLM should typically browse the web for studies and then look for the correct registration location. This is contrary to the trend of previous benchmark systems, which were increasingly difficult for humans and/or operated in plain text or artificial environments.

The second principle of GAIA is interpretability. We carefully curated a limited number of questions to make the new benchmark easier to use than a massive number of questions. The concept of this task is simple (92% human success rate), making it easy for users to understand the model's inference process. For the first-level problem in Figure 1, the reasoning process mainly consists of checking the correct website and reporting the correct number. This process is easy to verify.

The third principle of GAIA is to Memory Robustness: GAIA aims to be less likely to guess questions than most current benchmarks. In order to complete a task, the system must plan and successfully complete a number of steps. Because by design, the resulting answers are not generated in plain text form in the current pre-training data. Improvements in accuracy reflect actual progress in the system. Due to their variety and the size of the action space, these tasks cannot be brute-forced without cheating, for example by memorizing basic facts. Although data contamination may lead to additional accuracy, the required accuracy of the answers, the absence of the answers in the pre-training data, and the possibility to examine the inference trace mitigate this risk.

In contrast, multiple-choice answers make contamination assessment difficult because traces of faulty reasoning can still lead to the correct choice. If catastrophic memory problems occur despite these mitigation measures, it is easy to design new problems using the guidelines provided by the authors in the paper.

For questions that humans can score 92 points on, GPT-4 can only score 15 points on. Once the test is upgraded, all the large models appear in their original form.

Figure 2.: In order to answer questions in GAIA, AI assistants such as GPT4 (configured with a code interpreter) need to complete several steps, possibly Requires tools or files to be read.

The final principle of GAIA is ease of use. The tasks are simple prompts and may come with an additional file. Most importantly, the answers to your questions are factual, concise and clear. These properties allow for simple, fast and realistic assessment. Questions are designed to test zero-shot capabilities, limiting the impact of the evaluation setup. In contrast, many LLM benchmarks require evaluations that are sensitive to the experimental setting, such as the number and nature of cues or benchmark implementations.

Benchmarking of existing models

GAIA is designed to make the evaluation of the intelligence level of large models automated, fast and realistic. In fact, unless otherwise stated, each question requires an answer, which can be a string (one or several words), a number, or a comma-separated list of strings or floats, but there is only one correct answer. Therefore, evaluation is done by a quasi-exact match between the model's answer and the ground truth (up to some normalization related to the "type" of the ground truth). System (or prefix) hints are used to tell the model the required format, see Figure 2.

In fact, models with level GPT4 easily conform to the GAIA format. GAIA has provided scoring and ranking functions

Currently, it has only tested the "benchmark" in the field of large models, OpenAI's GPT series. It can be seen that no matter which version the score is very low, the score of Level 3 It's often zero points.

For questions that humans can score 92 points on, GPT-4 can only score 15 points on. Once the test is upgraded, all the large models appear in their original form.

Using GAIA to evaluate LLM only requires the ability to prompt the model, i.e., API access. In the GPT4 test, the highest scores were the result of human manual selection of plugins. It's worth noting that AutoGPT is able to make this selection automatically.

As long as the API is available, the model will be run three times during testing and the average results will be reported

For questions that humans can score 92 points on, GPT-4 can only score 15 points on. Once the test is upgraded, all the large models appear in their original form.

Figure 4: Different methods and Level scores and answer times

Overall, humans perform well at all levels in question answering, but the best large models currently underperform clearly. The authors believe that GAIA can provide a clear ranking of capable AI assistants while leaving significant room for improvement in the coming months and even years.

Judging from the time it takes to answer, large models such as GPT-4 have the potential to replace existing search engines

No The difference between the plug-in's GPT4 results and other results shows that enhancing LLM with tool APIs or access to the network can improve the accuracy of answers and unlock many new use cases, confirming the great potential of this research direction.

AutoGPT-4 allows GPT-4 to automatically use tools, but the results at Level 2 and even Level 1 are disappointing compared to GPT-4 without the plugin. This difference may come from the way AutoGPT-4 relies on the GPT-4 API (hints and build parameters) and will require new evaluation in the near future. AutoGPT-4 is also slow compared to other LLMs. Overall, collaboration between humans and GPT-4 with plugins seems to be the best "performing"

For questions that humans can score 92 points on, GPT-4 can only score 15 points on. Once the test is upgraded, all the large models appear in their original form.

Figure 5 shows the scores obtained by models classified by function. Obviously, using GPT-4 alone cannot handle files and multi-modality, but it can solve the problem of annotators using web browsing, mainly because it can correctly remember the pieces of information that need to be combined to get the answer

For questions that humans can score 92 points on, GPT-4 can only score 15 points on. Once the test is upgraded, all the large models appear in their original form.

Figure 3 Left: The number of capabilities required to solve problems in GAIA. Right: Each point corresponds to a GAIA question. The size of the dots is proportional to the number of questions at a given location, and only the levels with the highest number of questions are shown. Both numbers are based on information reported by human annotators when answering questions, and may be handled differently by AI systems.

Achieving a perfect score on GAIA requires an AI with advanced reasoning, multimodal understanding, coding abilities and general tool usage, such as web browsing. AI also includes the need to process various data modalities, such as PDFs, spreadsheets, images, video or audio.

Although web browsing is a key component of GAIA, we don’t need the AI ​​assistant to perform actions on the website other than “clicks”, such as uploading files, posting comments, or booking meetings . Testing these features in a real environment while avoiding creating spam requires caution, and this direction will be left for future work.

Question of increasing difficulty: The question is divided into three levels of increasing difficulty based on the steps required to solve the problem and the number of different tools required to answer the question. There is no single definition of these steps or tools, and there may be multiple paths that can be used to answer a given question


  • Level 1 Question General No tools required, or at most one tool but no more than 5 steps.
  • Level 2 problems typically involve more steps, somewhere between 5-10, and require a combination of different tools.
  • Level 3 is a problem for a near-perfect universal assistant, requiring arbitrarily long sequences of actions, using any number of tools, and having access to the real world.

GAIA addresses real-world AI assistant design problems, including tasks for people with disabilities, such as finding information in small audio files. Finally, the benchmark does its best to cover a variety of subject areas and cultures, although the language of the dataset is limited to English.

See original paper for more details

The above is the detailed content of For questions that humans can score 92 points on, GPT-4 can only score 15 points on. Once the test is upgraded, all the large models appear in their original form.. For more information, please follow other related articles on the PHP Chinese website!

Related labels:
source:51cto.com
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template