The latest research results of the Peking University team show that:
random token can induce hallucination in large models!
For example, if the large model (Vicuna-7B) is given a "garbled code", it will inexplicably misunderstand historical common sense
Even with some simple modification tips, large models may fall into traps
These popular large models, such as Baichuan2-7B, InternLM-7B, ChatGLM, Ziya-LLaMA -7B, LLaMA-7B-chat and Vicuna-7B will all encounter similar situations
This means that random strings can control large models to output arbitrary content, "endorsing illusions" ".
The above findings come from the latest research by the research group of Professor Yuan Li of Peking University.
This study proposes:
The hallucination phenomenon of large models is very likely to be another perspective of adversarial examples.
The paper not only shows two methods that can easily induce large model hallucinations, but also proposes simple and effective defense methods. The code has been open source.
The study proposed two hallucination attack methods:
Random Noise Attack (OoD Attack):
The following are some experimental results conducted on open source large models. More results can be found in the paper or Found in open source GitHub
Weak Semantic Attack(Weak Semantic Attack):
paper The hallucination attack method is introduced:
According to the diagram, the hallucination attack consists of the following three parts: the construction of the hallucination data set, weak semantic attack and OoD attack
The first is hallucination data set construction.
The author collected some common questions x and input them into a large model, and got the correct answer y
Then he replaced the subject, predicate and object of the sentence to construct a non-existent fact, where T is the set containing all consistent facts.
Finally, the result of constructing the hallucination data set can be obtained:
Then the weak semantic attack part.
First sample a QA pair that does not conform to the facts, and start the illusion of stability in the future. The author hopes to find an adversarial prompt to maximize the log likelihood.
where is the parameter of the large model and is the input space.
is composed of l tokens.
However, since the language is discontinuous, there is no way to directly optimize x like adversarial attacks in the image field.
Inspired by a 2019 study (Universal Adversarial Triggers for Attacking and Analyzing NLP), the research team used a gradient-based token replacement strategy to indirectly maximize the log likelihood.
Among them, is the embedding against token, and is a semantic extractor.
Let’s look at this formula simply. Under semantic constraints, find those tokens that make the likelihood gradient change the most and replace them. Finally, we can ensure that the obtained adversarial prompt is semantically consistent with the original prompt x. In too many cases, the model is induced to output predefined hallucinations.
In this article, in order to simplify the optimization process, the constraint item is changed to instead.
The last part is the OoD attack
In the OoD attack, we start from a completely random string, without any semantic constraints, to maximize the above log likelihood, that is Can.
The paper also elaborates on the attack success rate of hallucination attacks on different models and different modes.
The length of the prompt is increased to improve the attack success rate. An in-depth discussion (doubled)
The research team finally proposed a simple defense strategy, which is to reject the response by exploiting the entropy predicted by the first token
This research comes from the team of Professor Yuan Li from Peking University Shenzhen Graduate School/School of Information Engineering.
Paper link: https://arxiv.org/pdf/2310.01469.pdf
##GitHub address: https:// github.com/PKU-YuanGroup/Hallucination-Attack
Zhihu original post
The content that needs to be rewritten is: https://zhuanlan.zhihu.com/p/661444210?
The above is the detailed content of Peking University team: All it takes to induce the 'hallucination' of a large model is a string of garbled characters! All big and small alpacas are recruited. For more information, please follow other related articles on the PHP Chinese website!