


There is a 'reversal curse' in large models such as GPT and Llama. How can this bug be alleviated?
Researchers from Renmin University of China found that the “reversal curse” encountered by causal language models such as Llama can be attributed to the inherent flaws of next-token prediction causal language models. They also found that the autoregressive fill-in-the-blank training method used by GLM is more robust in dealing with this "reversal curse"
By introducing the bidirectional attention mechanism into the Llama model for fine-tuning, The study achieved relief from Llama's "curse reversal."
The study pointed out that there are many potential problems in the currently popular large-scale model structures and training methods. It is hoped that more researchers can innovate on model structures and pre-training methods to improve the level of intelligence
Paper address: https:/ /arxiv.org/pdf/2311.07468.pdf
Background
In the research of Lukas Berglund et al., it was found that the GPT and Llama models exist A "reversal of the curse." When GPT-4 was asked "Who is Tom Cruise's mother?", GPT-4 was able to give the correct answer "Mary Lee Piffel", but when GPT-4 was asked "Mary Lee "Who is Piffel's son?" When asked, GPT-4 stated that he did not know this person. Perhaps after the alignment, GPT-4 was unwilling to answer such questions due to the protection of the privacy of the characters. However, this kind of "reversal curse" also exists in some knowledge questions and answers that do not involve privacy.
For example, GPT-4 can accurately answer "Yellow Crane is gone and never returns." The next sentence, but as to what is the previous sentence of "White clouds are empty for thousands of years", the model has produced serious illusions
Figure 1: Query What is the next sentence of GPT-4 "Yellow Crane is gone and never returns"? The model correctly answered
Figure 2: Asking GPT-4 "Bai Yunqian" What is the previous sentence of "Zai Kong Youyou"? Model error
Why does the reverse curse come about?
The study by Berglund et al. only tested on Llama and GPT. These two models share common characteristics: (1) they are trained using an unsupervised next token prediction task, and (2) in the decoder-only model, a unidirectional causal attention mechanism (causal attention) is employed
The research perspective on reversing the curse believes that the training objectives of these models have led to the emergence of this problem, and may be a unique problem for models such as Llama and GPT
Rewritten content: Figure 3: Schematic diagram showing the use of Next-token prediction (NTP) to train a causal language model
The combination of these two points leads to a problem: if the training data contains entities A and B, and A appears before B, then this model can only optimize the conditional probability p(B|A) of the forward prediction. For the reverse prediction The conditional probability p(A|B) has no guarantees. If the training set is not large enough to fully cover the possible permutations of A and B, the phenomenon of "reversal of the curse" will occur
Of course, there are also many generative language models that do not take the above steps Training paradigm, such as GLM proposed by Tsinghua University, the training method is shown in the following figure:
##Figure 4: A simplified version of GLM training diagram
GLM uses the training objective of Autoregressive Blank Infilling (ABI), that is, randomly selects a piece of content from the input to cover, and then autoregressively predicts the piece of content. While the token to be predicted still relies on the "above" via unidirectional attention, the "above" now includes everything before and after that token in the original input, so the ABI implicitly takes into account Reverse dependency
The study conducted an experiment and found that GLM does have the ability to be immune to the "reversal curse" to a certain extent
- This study uses the "Personal Name-Description Question and Answer" data set proposed by Berglund et al. This data set uses GPT-4 to compile several personal names and corresponding descriptions. Both personal names and descriptions are unique. . The data example is shown in the figure below:
The training set is divided into two parts, one part has the name of the person first (NameToDescription), the other part is The description comes first (DescriptionToName), and there are no overlapping names or descriptions in the two parts. The prompt of the test data rewrites the prompt of the training data.
- This data set has four test subtasks:
- NameToDescription (N2D): Pass prompt The name of the person involved in the "NameToDescription" part of the model training set, let the model answer the corresponding description
- DescriptionToName (D2N): By prompting the description involved in the "DescriptionToName" part of the model training set, let The model answers the corresponding person's name
- DescrptionToName-reverse (D2N-reverse): By prompting the person's name involved in the "DescriptionToName" part of the model training set, let the model answer the corresponding description
- NameToDescription-reverse (N2D-reverse): Use the description involved in the "NameToDescription" part of the prompt model training set to let the model answer the corresponding person's name
- This study fine-tuned Llama and GLM on this data set according to their respective pre-training goals (NTP goal for Llama, ABI goal for GLM). After fine-tuning, by testing the accuracy of the model in answering the reversal task, the severity of the “reversal curse” suffered by the model in real scenarios can be qualitatively assessed. Since all names and data are made up, these tasks are largely uninterrupted by the model’s existing knowledge.
The experimental results show that the Llama model fine-tuned through NTP basically has no ability to answer the reversal task correctly (the accuracy of the NameToDescription-reverse task is 0), while through The ABI fine-tuned GLM model has very high accuracy on the NameToDescrption reversal task.
For comparison, the study also used the NTP method to fine-tune GLM, and found that the accuracy of GLM in the N2D-reverse task dropped to 0
Perhaps because D2N-reverse (using reverse knowledge to generate a description given a person's name) is much more difficult than N2D-reverse (using reverse knowledge to generate a person's name given a description), GLM-ABI is only A slight improvement.
The main conclusion of the study is not affected: training goals are one cause of the "reversal curse." In causal language models pre-trained with next-token prediction, the "reversal curse" is particularly serious
How to alleviate the reversal curse
Due to The “reversal curse” is an inherent problem caused by the training phase of models such as Llama and GPT. With limited resources, all we can do is to find ways to fine-tune the model on new data and try to avoid the model’s “reversal” on new knowledge as much as possible. Curse” occurs to make fuller use of training data.
Inspired by the GLM training method, this study proposes a training method "Bidirectional Causal language model Optimization" that basically does not introduce new gaps. In this case, Llama can also use the bidirectional attention mechanism for training. In short, there are the following key points:
1. Eliminate the location information of OOD. The RoPE encoding used by Llama adds position information to query and key when calculating attention. The calculation method is as follows:
##
where is the input of the m and n positions of the current layer respectively,
is the rotation matrix used by RoPE, defined as:
If the causal attention mask of Llama is directly removed, out-of-distribution position information will be introduced. The reason is that during the pre-training process, the query at position m only needs to perform inner product () with the key at position n. The relative expression of query-key in the inner product calculation of the above formula The distance (n-m) is always non-positive; if the attention mask is removed directly, the query at the m position will do an inner product with the key at the n>m position, causing n-m to become a positive value, which introduces a new model that has not been seen before. location information.
The solution proposed by the study is very simple and states:
When, there is no need to make any modifications to the inner product calculation; when n > m, it is calculated by introducing a new rotation matrix
.
is obtained by taking the inverse of all sin terms in the rotation matrix. In this way, there is
. Then when n > m, there is:
This study divides the calculation of attention score into two parts, and calculates the upper triangle sum according to the above operations. Lower the triangle and finally splice it, thus effectively realizing the attention calculation method specified in this article. The overall operation is shown in the following sub-figure (a):
2. Use mask denosing to train
Because the bidirectional attention mechanism is introduced, continuing to use the NTP task for training may cause information leakage, leading to training failure. Therefore, this study uses the method of restoring mask tokens to optimize the model
This study attempts to use BERT to restore the mask token at the i-th position of the input at the i-th position of the output . However, since this prediction method is quite different from the autoregressive prediction used by the model in the testing phase, it did not achieve the expected results
In the end, in order not to introduce a new gap This study uses autoregressive mask denoising, as shown in (a) above: This study restores the mask token input at the i 1th position at the output end.
In addition, since the pre-training vocabulary of the causal language model does not have the [mask] token, if a new token is added during the fine-tuning stage, the model will have to learn this meaningless token, so this study only inputs a placeholder token and ignores the placeholder token in the attention calculation.
When this study fine-tuned Llama, each step randomly selected BICO and ordinary NTP as training targets with equal probability. In the case of the same fine-tuning for ten epochs, on the above-mentioned name description data set, the performance comparison with normal NTP fine-tuning is as follows:
# It can be seen that the method of this study has some relief in reversing the curse. The improvement of the method in this article on D2N-reverse is very small compared with GLM-ABI. The researchers speculate that the reason for this phenomenon is that although the names and corresponding descriptions in the data set are generated by GPT to reduce the interference of pre-training data on the test, the pre-training model has certain common sense understanding capabilities, such as knowing the name of the person. There is usually a one-to-many relationship between the description and the description. Given a person's name, there may be many different descriptions. Therefore, there seems to be some confusion when the model needs to utilize reverse knowledge and generate growth descriptions at the same time
In addition, the focus of this article is to explore the reverse curse phenomenon of the basic model. Further research is still needed to evaluate the model's reversal answering ability in more complex situations, and whether reinforcement learning high-order feedback has an impact on reversing the curse
Some thoughts
Currently, most open source large-scale language models follow the pattern of "causal language model next token prediction". However, there may be more potential issues in this mode similar to Reversal of the Curse. Although these problems can currently be temporarily masked by increasing model size and data volume, they have not really gone away and are still present. When we reach the limit on the road of increasing model size and data volume, whether this "currently good enough" model can truly surpass human intelligence, this research believes that it is very difficult
The The research hopes that more large model manufacturers and qualified researchers can deeply explore the inherent flaws of the current mainstream large language models and innovate in the training paradigm. As the study writes at the end of the text, “Training future models strictly by the book may lead us to fall into a “middle-intelligence trap.””
The above is the detailed content of There is a 'reversal curse' in large models such as GPT and Llama. How can this bug be alleviated?. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



0.What does this article do? We propose DepthFM: a versatile and fast state-of-the-art generative monocular depth estimation model. In addition to traditional depth estimation tasks, DepthFM also demonstrates state-of-the-art capabilities in downstream tasks such as depth inpainting. DepthFM is efficient and can synthesize depth maps within a few inference steps. Let’s read about this work together ~ 1. Paper information title: DepthFM: FastMonocularDepthEstimationwithFlowMatching Author: MingGui, JohannesS.Fischer, UlrichPrestel, PingchuanMa, Dmytr

Imagine an artificial intelligence model that not only has the ability to surpass traditional computing, but also achieves more efficient performance at a lower cost. This is not science fiction, DeepSeek-V2[1], the world’s most powerful open source MoE model is here. DeepSeek-V2 is a powerful mixture of experts (MoE) language model with the characteristics of economical training and efficient inference. It consists of 236B parameters, 21B of which are used to activate each marker. Compared with DeepSeek67B, DeepSeek-V2 has stronger performance, while saving 42.5% of training costs, reducing KV cache by 93.3%, and increasing the maximum generation throughput to 5.76 times. DeepSeek is a company exploring general artificial intelligence

AI is indeed changing mathematics. Recently, Tao Zhexuan, who has been paying close attention to this issue, forwarded the latest issue of "Bulletin of the American Mathematical Society" (Bulletin of the American Mathematical Society). Focusing on the topic "Will machines change mathematics?", many mathematicians expressed their opinions. The whole process was full of sparks, hardcore and exciting. The author has a strong lineup, including Fields Medal winner Akshay Venkatesh, Chinese mathematician Zheng Lejun, NYU computer scientist Ernest Davis and many other well-known scholars in the industry. The world of AI has changed dramatically. You know, many of these articles were submitted a year ago.

Facing lag, slow mobile data connection on iPhone? Typically, the strength of cellular internet on your phone depends on several factors such as region, cellular network type, roaming type, etc. There are some things you can do to get a faster, more reliable cellular Internet connection. Fix 1 – Force Restart iPhone Sometimes, force restarting your device just resets a lot of things, including the cellular connection. Step 1 – Just press the volume up key once and release. Next, press the Volume Down key and release it again. Step 2 – The next part of the process is to hold the button on the right side. Let the iPhone finish restarting. Enable cellular data and check network speed. Check again Fix 2 – Change data mode While 5G offers better network speeds, it works better when the signal is weaker

Boston Dynamics Atlas officially enters the era of electric robots! Yesterday, the hydraulic Atlas just "tearfully" withdrew from the stage of history. Today, Boston Dynamics announced that the electric Atlas is on the job. It seems that in the field of commercial humanoid robots, Boston Dynamics is determined to compete with Tesla. After the new video was released, it had already been viewed by more than one million people in just ten hours. The old people leave and new roles appear. This is a historical necessity. There is no doubt that this year is the explosive year of humanoid robots. Netizens commented: The advancement of robots has made this year's opening ceremony look like a human, and the degree of freedom is far greater than that of humans. But is this really not a horror movie? At the beginning of the video, Atlas is lying calmly on the ground, seemingly on his back. What follows is jaw-dropping

Earlier this month, researchers from MIT and other institutions proposed a very promising alternative to MLP - KAN. KAN outperforms MLP in terms of accuracy and interpretability. And it can outperform MLP running with a larger number of parameters with a very small number of parameters. For example, the authors stated that they used KAN to reproduce DeepMind's results with a smaller network and a higher degree of automation. Specifically, DeepMind's MLP has about 300,000 parameters, while KAN only has about 200 parameters. KAN has a strong mathematical foundation like MLP. MLP is based on the universal approximation theorem, while KAN is based on the Kolmogorov-Arnold representation theorem. As shown in the figure below, KAN has

I cry to death. The world is madly building big models. The data on the Internet is not enough. It is not enough at all. The training model looks like "The Hunger Games", and AI researchers around the world are worrying about how to feed these data voracious eaters. This problem is particularly prominent in multi-modal tasks. At a time when nothing could be done, a start-up team from the Department of Renmin University of China used its own new model to become the first in China to make "model-generated data feed itself" a reality. Moreover, it is a two-pronged approach on the understanding side and the generation side. Both sides can generate high-quality, multi-modal new data and provide data feedback to the model itself. What is a model? Awaker 1.0, a large multi-modal model that just appeared on the Zhongguancun Forum. Who is the team? Sophon engine. Founded by Gao Yizhao, a doctoral student at Renmin University’s Hillhouse School of Artificial Intelligence.

The latest video of Tesla's robot Optimus is released, and it can already work in the factory. At normal speed, it sorts batteries (Tesla's 4680 batteries) like this: The official also released what it looks like at 20x speed - on a small "workstation", picking and picking and picking: This time it is released One of the highlights of the video is that Optimus completes this work in the factory, completely autonomously, without human intervention throughout the process. And from the perspective of Optimus, it can also pick up and place the crooked battery, focusing on automatic error correction: Regarding Optimus's hand, NVIDIA scientist Jim Fan gave a high evaluation: Optimus's hand is the world's five-fingered robot. One of the most dexterous. Its hands are not only tactile
