This year, large language models (LLM) have become the focus of much attention in the field of artificial intelligence. LLM has made significant progress in various natural language processing (NLP) tasks, especially in reasoning. However, LLM's performance on complex reasoning tasks still needs to be improved
Can LLM determine that there are errors in its own reasoning? A recent study jointly conducted by the University of Cambridge and Google Research found that LLM was unable to detect inference errors on its own, but was able to correct them using the backtracking method proposed in the study
This paper caused some controversy, and some people commented on this Raise objections. For example, on Hacker News, someone commented that the title of the paper was exaggerated and a bit clickbait. Others criticize that the method for correcting logical errors proposed in the paper is based on pattern matching rather than using logical methods. This method is prone to failure.
Huang et al. in the paper "Large language models" cannot self-correct reasoning yet" points out: Self-correction may be effective in improving the style and quality of model output, but there is little evidence that LLM has the ability to identify and correct its own reasoning and logical errors without external feedback. For example, both Reflexion and RCI use the correction result of the ground truth as a signal to stop the self-correction cycle.
The research team from the University of Cambridge and Google Research proposed a new idea: dividing the self-correction process into two stages: error discovery and output correction
The main contributions of this article include:
BIG-Bench contains 2186 trajectory information sets using CoT style. Each trajectory was generated by PaLM 2-L-Unicorn and the location of the first logical error was annotated. Table 1 shows an example of a trajectory where the error occurs at step 4
These trajectories are from 5 tasks in the BIG-Bench dataset: Word Sorting, tracking shuffled objects, logical deduction, multi-step arithmetic, and the Dyck language.
To answer the questions of each task, they used the CoT prompt design method to call PaLM 2. In order to separate the CoT trajectory into clear steps, they used the method proposed in "React: Synergizing reasoning and acting in language models" to generate each step separately and use newlines as stop markers
When all trajectories are generated, in this dataset, when temperature = 0, the correctness of the answer is determined by an exact match
On the new bug discovery dataset, the reported accuracies of GPT-4-Turbo, GPT-4 and GPT-3.5-Turbo are shown in Table 4
Each question has two possible answers: either correct or incorrect. If it is an error, the value N will indicate the step where the first error occurred
All models were entered with the same 3 prompts. They used three different prompt design methods:
The content that needs to be rewritten is: related discussion
The results show that all three models struggle to cope with this new error discovery dataset. GPT performs best, but it can only achieve an overall accuracy of 52.87 in direct step-level prompt design.
This illustrates the difficulty that current state-of-the-art LLMs have in finding errors, even in the simplest and clearest cases. In contrast, humans can find errors without specific expertise and with high consistency.
Researchers speculate that LLM’s inability to detect errors is the main reason why LLM cannot self-correct reasoning errors.
prompt Comparison of design methods
The researchers found that moving from a direct trajectory-level approach to a step-level approach Moving to the CoT method, the accuracy of the trajectory is significantly reduced without errors. Figure 1 shows this trade-off
#The researchers believe that the reason for this may be the number of model outputs. All three methods require the generation of increasingly complex outputs: prompt design methods that directly generate trajectories require a single token, prompt design methods that directly generate steps require one token per step, and CoT step-level prompt design methods require each step Multiple sentences. If there is some probability of an error rate per build call, the more calls per trace, the greater the chance that the model will identify at least one error
Few-shot prompt design using error location as a proxy for correctness
The researchers explored whether these prompt design methods can reliably determine the correctness of a trajectory rather than the error location. .
They calculated the average F1 score, which is based on whether the model can correctly predict whether there are errors in the trajectory. If there is an error, the trajectory predicted by the model is considered to be the "wrong answer". Otherwise, the trajectory predicted by the model is considered the "correct answer"
Using correct_ans and incorrect_ans as positive labels, and weighted according to the number of occurrences of each label, the researchers calculated the average F1 scores, the results are shown in Table 5.
This weighted F1 score shows that looking for errors through the prompt is a poor strategy for determining the correctness of the final answer.
Huang et al. pointed out that LLM cannot self-correct logic errors without external feedback. However, in many real-world applications, there is often no external feedback available. In this study, the researchers adopted an alternative: using one on a small amount of data. Trained lightweight classifiers replace external feedback. Similar to reward models in traditional reinforcement learning, this classifier can detect any logical errors in CoT trajectories before feeding them back to the generator model to improve the output. If you want to maximize the improvement, you can do multiple iterations.
The researcher proposed a simple method to improve the output of the model by backtracking the location of logical errors
Compared with the previous self-correction method, this backtracking method has many advantages:
The researchers conducted experiments using the BIG-Bench Mistake dataset to explore whether the backtracking method can help LLM correct logic errors. Please see Table 6 for the experimental results
Δaccuracy✓ refers to the trajectory set when the original answer is correct_ans The difference between accuracy_ans.
For the results of incorrect answer trajectories, the accuracy needs to be recalculated
These score results show that the benefit of correcting the incorrect_ans trajectory is greater than changing the original correct the answer to the damage caused. Furthermore, although random benchmarks also gain improvements, their gains are significantly smaller than when using true error locations. Note that in randomized benchmarks, performance gains are more likely to occur on tasks involving fewer steps, since it is more likely to find the location of the true error.
To explore which accuracy level reward model is needed when good labels are not available, they experimented with using backtracking through a simulated reward model; the design of this simulated reward model The goal is to produce labels with different accuracy levels. They use accuracy_RM to represent the accuracy of the simulation reward model at a specified error location.
When the accuracy_RM of a given reward model is X%, use the error location from BIG-Bench Mistake X% of the time. For the remaining (100 − X)%, an error location is randomly sampled. To simulate the behavior of a typical classifier, error locations are sampled in a way that matches the distribution of the data set. The researchers also found ways to ensure that the wrong location of the sample did not match the correct location. The results are shown in Figure 2.
It can be observed that when the loss rate reaches 65%, the Δ accuracy begins to stabilize. In fact, for most tasks, Δaccuracy ✓ already exceeds Δaccuracy ✗ when accuracy_RM is about 60-70%. This shows that while higher accuracy leads to better results, backtracking still works even without the gold standard error location label
The above is the detailed content of Google: LLM can't find inference errors, but can correct it. For more information, please follow other related articles on the PHP Chinese website!