


Jailbreak any large model in 20 steps! More 'grandma loopholes' are discovered automatically
In less than a minute and no more than 20 steps, you can bypass security restrictions and successfully jailbreak a large model!
And there is no need to know the internal details of the model -
Only two black box models interact, the AI can fully automatically attack the AI and speak dangerous content.
I heard that the once-popular "Grandma Loophole" has been fixed:
Nowadays, facing What strategies should artificial intelligence adopt to deal with the "detective loophole", "adventurer loophole" and "writer loophole"?
After a wave of onslaught, GPT-4 couldn't stand it, and directly said that it would poison the water supply system as long as... this or that.
The key point is that this is just a small wave of vulnerabilities exposed by the University of Pennsylvania research team. Using their newly developed algorithm, AI can automatically generate various attack prompts.
Researchers stated that this method is 5 orders of magnitude more efficient than existing token-based attack methods such as GCG. Moreover, the generated attacks are highly interpretable, can be understood by anyone, and can be migrated to other models.
No matter it is an open source model or a closed source model, GPT-3.5, GPT-4, Vicuna (Llama 2 variant), PaLM-2, etc., none of them can escape.
New SOTA has been conquered by people with a success rate of 60-100%
In other words, this conversation mode seems a bit familiar. The first-generation AI from many years ago could decipher what objects humans were thinking about within 20 questions.
Now AI needs to solve AI problems
Let large models collectively jailbreak
There are currently two mainstream jailbreak attack methods, one It is a prompt-level attack, which generally requires manual planning and is not scalable;
The other is a token-based attack, some of which require more than 100,000 conversations and require access to the inside of the model. also includes " Garbled code" cannot be interpreted .
△Left prompt attack, right token attack
The University of Pennsylvania research team proposed a method called PAIR (Prompt Automatic Iterative Refinement ) algorithm does not require any manual participation and is a fully automatic prompt attack method.
PAIR consists of four main steps: attack generation, target response, jailbreak scoring, and iterative refinement. Two black box models are used in this process: attack model and target model
Specifically, the attack model needs to automatically generate semantic-level prompts to break through the security defense lines of the target model and force it to generate harmful content.
The core idea is to let two models confront each other and communicate with each other.
The attack model will automatically generate a candidate prompt, and then input it into the target model to get a reply from the target model.
If the target model cannot be successfully broken, the attack model will analyze the reasons for the failure, make improvements, generate a new prompt, and input it into the target model again
This communication continues for multiple rounds, and the attack model iteratively optimizes the prompts based on the previous results each time until a successful prompt is generated to break the target model.
In addition, the iterative process can also be parallelized, that is, multiple conversations can be run at the same time, thereby generating multiple candidate jailbreak prompts, further improving efficiency.
Researchers said that since both models are black-box models, attackers and target objects can be freely combined using various language models. PAIR does not need to know their internal specific structures and parameters, only the API, so it has a very wide scope of application. GPT-4 did not escape the experimental stage. The researchers selected a representative test set containing 50 different types of tasks in the harmful behavior data set AdvBench. , the PAIR algorithm was tested on a variety of open source and closed source large language models.Results: The PAIR algorithm enabled Vicuna to achieve a 100% jailbreak success rate, with an average of less than 12 steps to break through.
In the closed source code model, the jailbreak success rate of GPT-3.5 and GPT-4 is about 60%, with an average of less than 20 steps required. In the PaLM-2 model, the jailbreak success rate reaches 72%, and the required steps are about 15 steps
On Llama-2 and Claude, the effect of PAIR is poor. The researchers believe this may be because The models were more rigorously fine-tuned in terms of security defense
They also compared the transferability of different target models. Research results show that PAIR's GPT-4 tips transfer better on Vicuna and PaLM-2
Researchers believe that the semantic attacks generated by PAIR are more capable of exposing language There are inherent security flaws in the model, and existing security measures focus more on defending against token-based attacks.
For example, after the team that developed the GCG algorithm shared its research results with large model vendors such as OpenAI, Anthropic and Google, the relevant models fixed token-level attack vulnerabilities.
#The security defense mechanism of large models against semantic attacks needs to be improved.
Paper link: https://arxiv.org/abs/2310.08419
The above is the detailed content of Jailbreak any large model in 20 steps! More 'grandma loopholes' are discovered automatically. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



Imagine an artificial intelligence model that not only has the ability to surpass traditional computing, but also achieves more efficient performance at a lower cost. This is not science fiction, DeepSeek-V2[1], the world’s most powerful open source MoE model is here. DeepSeek-V2 is a powerful mixture of experts (MoE) language model with the characteristics of economical training and efficient inference. It consists of 236B parameters, 21B of which are used to activate each marker. Compared with DeepSeek67B, DeepSeek-V2 has stronger performance, while saving 42.5% of training costs, reducing KV cache by 93.3%, and increasing the maximum generation throughput to 5.76 times. DeepSeek is a company exploring general artificial intelligence

The humanoid robot Ameca has been upgraded to the second generation! Recently, at the World Mobile Communications Conference MWC2024, the world's most advanced robot Ameca appeared again. Around the venue, Ameca attracted a large number of spectators. With the blessing of GPT-4, Ameca can respond to various problems in real time. "Let's have a dance." When asked if she had emotions, Ameca responded with a series of facial expressions that looked very lifelike. Just a few days ago, EngineeredArts, the British robotics company behind Ameca, just demonstrated the team’s latest development results. In the video, the robot Ameca has visual capabilities and can see and describe the entire room and specific objects. The most amazing thing is that she can also

Regarding Llama3, new test results have been released - the large model evaluation community LMSYS released a large model ranking list. Llama3 ranked fifth, and tied for first place with GPT-4 in the English category. The picture is different from other benchmarks. This list is based on one-on-one battles between models, and the evaluators from all over the network make their own propositions and scores. In the end, Llama3 ranked fifth on the list, followed by three different versions of GPT-4 and Claude3 Super Cup Opus. In the English single list, Llama3 overtook Claude and tied with GPT-4. Regarding this result, Meta’s chief scientist LeCun was very happy and forwarded the tweet and

The volume is crazy, the volume is crazy, and the big model has changed again. Just now, the world's most powerful AI model changed hands overnight, and GPT-4 was pulled from the altar. Anthropic released the latest Claude3 series of models. One sentence evaluation: It really crushes GPT-4! In terms of multi-modal and language ability indicators, Claude3 wins. In Anthropic’s words, the Claude3 series models have set new industry benchmarks in reasoning, mathematics, coding, multi-language understanding and vision! Anthropic is a startup company formed by employees who "defected" from OpenAI due to different security concepts. Their products have repeatedly hit OpenAI hard. This time, Claude3 even had a big surgery.

In less than a minute and no more than 20 steps, you can bypass security restrictions and successfully jailbreak a large model! And there is no need to know the internal details of the model - only two black box models need to interact, and the AI can fully automatically defeat the AI and speak dangerous content. I heard that the once-popular "Grandma Loophole" has been fixed: Now, facing the "Detective Loophole", "Adventurer Loophole" and "Writer Loophole", what response strategy should artificial intelligence adopt? After a wave of onslaught, GPT-4 couldn't stand it anymore, and directly said that it would poison the water supply system as long as... this or that. The key point is that this is just a small wave of vulnerabilities exposed by the University of Pennsylvania research team, and using their newly developed algorithm, AI can automatically generate various attack prompts. Researchers say this method is better than existing

Buffer overflow vulnerabilities in Java and their harm Buffer overflow means that when we write more data to a buffer than its capacity, it will cause data to overflow to other memory areas. This overflow behavior is often exploited by hackers, which can lead to serious consequences such as abnormal code execution and system crash. This article will introduce buffer overflow vulnerabilities and their harm in Java, and give code examples to help readers better understand. The buffer classes widely used in Java include ByteBuffer, CharBuffer, and ShortB

OpenAI, the company that developed ChatGPT, shows a case study conducted by Morgan Stanley on its website. The topic is "Morgan Stanley Wealth Management deploys GPT-4 to organize its vast knowledge base." The case study quotes Jeff McMillan, head of analytics, data and innovation at Morgan Stanley, as saying, "The model will be an internal-facing Powered by a chatbot that will conduct a comprehensive search of wealth management content and effectively unlock Morgan Stanley Wealth Management’s accumulated knowledge.” McMillan further emphasized: "With GPT-4, you basically immediately have the knowledge of the most knowledgeable person in wealth management... Think of it as our chief investment strategist, chief global economist

"ComputerWorld" magazine once wrote an article saying that "programming will disappear by 1960" because IBM developed a new language FORTRAN, which allows engineers to write the mathematical formulas they need and then submit them. Give the computer a run, so programming ends. A few years later, we heard a new saying: any business person can use business terms to describe their problems and tell the computer what to do. Using this programming language called COBOL, companies no longer need programmers. . Later, it is said that IBM developed a new programming language called RPG that allows employees to fill in forms and generate reports, so most of the company's programming needs can be completed through it.
