"Anyone who thinks that auto-regressive LLM is already approaching human-level AI, or that it simply needs to scale up to reach human-level AI, must read this. AR-LLM has very limited reasoning and planning capabilities , to solve this problem, it cannot be solved by making them larger and training with more data."
For a long time, Figure Yann LeCun, the winner of the Spirit Award, is a "questioner" of LLM, and the autoregressive model is the learning paradigm that the GPT series of LLM models rely on. He has publicly expressed his criticism of autoregression and LLM more than once, and has produced many golden sentences, such as:
"In five years from now, no one in their right mind will Will use autoregressive models."
"Auto-Regressive Generative Models suck!"
"LLM has a very superficial understanding of the world."
What made LeCun cry out again recently are two newly released papers:
"Can LLM really self-criticize (and iteratively improve) its solutions as the literature suggests? Two new papers from our group reason (https://arxiv. org/abs/2310.12397) and planning (https://arxiv.org/abs/2310.08118) missions to investigate (and challenge) these claims."
See Well, the theme of these two papers investigating the verification and self-criticism capabilities of GPT-4 has resonated with many people.
The authors of the paper stated that they also believe that LLM is a great "idea generator" (whether in language form or code form), but they cannot guarantee their own planning/reasoning capabilities. Therefore, they are best used in an LLM-Modulo environment (with either a reliable reasoner or a human expert in the loop). Self-criticism requires verification, and verification is a form of reasoning (so be surprised by all the claims about LLM's ability to self-criticize).
At the same time, there are also voices of doubt: "The reasoning ability of convolutional networks is more limited, but this does not prevent AlphaZero's work from appearing. It is all about the reasoning process and establishment (RL) feedback loop. I think model capabilities allow for extremely deep reasoning (e.g., research-level mathematics)."
In this regard, LeCun's The idea is: "AlphaZero "really" executes the plan. This is done via a Monte Carlo tree search, using a convolutional network to come up with good actions and another convolutional network to evaluate the position. The time spent exploring the tree could be infinite, that's all reasoning and planning. "
In the future, the topic of whether autoregressive LLM has reasoning and planning capabilities may not be finalized.
Next, we can take a look at what these two new papers talk about.
Paper 1: GPT-4 Doesn't Know It's Wrong: An Analysis of Iterative Prompting for Reasoning Problems
The first paper raised questions among researchers about the self-criticism ability of state-of-the-art LLM, including GPT-4.
Paper address: https://arxiv.org/pdf/2310.12397.pdf
Connect Let's take a look at the introduction of the paper.
People have always had considerable disagreements about the reasoning capabilities of large language models (LLMs). Initially, researchers were optimistic that the reasoning capabilities of LLMs would automatically appear as the model scale expands. , however, as more failures emerged, expectations became less intense. Afterwards, researchers generally believed that LLM has the ability to self-criticize and improve LLM solutions in an iterative manner, and this view has been widely disseminated.
But is this really the case?
Researchers from Arizona State University examined the reasoning capabilities of LLM in a new study. Specifically, they focused on the effectiveness of iterative prompting in the graph coloring problem, one of the most famous NP-complete problems.
The study shows that (i) LLM is not good at solving graph coloring instances (ii) LLM is not good at validating solutions and is therefore ineffective in iterative mode. The results of this paper thus raise questions about the self-critical capabilities of state-of-the-art LLMs.
The paper gives some experimental results, for example, in direct mode, LLM is very bad at solving graph coloring instances. In addition, the study also found that LLM is not good at verifying the solution. Worse yet, the system fails to recognize the correct color and ends up with the wrong color.
The following figure is an evaluation of the graph colorization problem. In this setting, GPT-4 can guess colors in an independent and self-critical mode. Outside of the self-critical loop there is an external voice validator.
The results show that GPT4 is less than 20% accurate at guessing colors, and even more surprisingly, the self-criticism mode (image below) Column 2) has the lowest accuracy. This paper also examines the related question of whether GPT-4 would improve its solution if an external vocal verifier provided provably correct criticisms of the colors it guesses. In this case, reverse hinting can really improve performance.
Even if GPT-4 accidentally guesses a valid color, its self-criticism may cause it to hallucinate that there is no violation .
Finally, the author gives a summary, regarding the problem of graph coloring:
Paper 2: Can Large Language Models Really Improve by Self-critiquing Their Own Plans?
In the paper "Can Large Language Models Really Improve by Self-critiquing Their Own Plans?", the research team explored the ability of LLM to self-verify/criticize in the context of planning.
This paper provides a systematic study of the ability of LLMs to critique their own outputs, particularly in the context of classical planning problems. While recent research has been optimistic about the self-critical potential of LLMs, especially in iterative settings, this study suggests a different perspective.
Paper address: https://arxiv.org/abs/2310.08118
Unexpected However, the results show that self-criticism degrades the performance of plan generation, especially compared to systems with external verifiers and LLM verifiers. LLM can produce a large number of error messages, thereby compromising the reliability of the system.
The researchers’ empirical evaluation on the classic AI planning domain Blocksworld highlights that the self-critical function of LLM is not effective in planning problems. The validator can generate a large number of errors, which is detrimental to the reliability of the entire system, especially in areas where the correctness of planning is critical.
Interestingly, the nature of the feedback (binary or detailed feedback) has no significant impact on plan generation performance, suggesting that the core issue lies in the binary verification capabilities of LLM rather than the granularity of the feedback.
As shown in the figure below, the evaluation architecture of this study includes 2 LLMs - generator LLM and verifier LLM. For a given instance, the generator LLM is responsible for generating candidate plans, while the verifier LLM determines their correctness. If the plan is found to be incorrect, the validator provides feedback giving the reason for its error. This feedback is then transferred to the generator LLM, which prompts the generator LLM to generate new candidate plans. All experiments in this study used GPT-4 as the default LLM.
This study experiments and compares several plan generation methods on Blocksworld. Specifically, the study generated 100 random instances for evaluation of various methods. To provide a realistic assessment of the correctness of the final LLM planning, the study employs an external validator VAL.
As shown in Table 1, the LLM LLM backprompt method is slightly better than the non-backprompt method in terms of accuracy.
Out of 100 instances, the validator accurately identified 61 (61%).
The table below shows the performance of LLM when receiving different levels of feedback, including no feedback.
The above is the detailed content of LeCun once again badmouthed autoregressive LLM: GPT-4's reasoning ability is very limited, as evidenced by two papers. For more information, please follow other related articles on the PHP Chinese website!