Jailbreak any large model in 20 steps! More 'grandma loopholes' are discovered automatically-AI-php.cn

Table of Contents

Let large models collectively jailbreak

△Left prompt attack, right token attack

Home

Technology peripherals

Jailbreak any large model in 20 steps! More 'grandma loopholes' are discovered automatically

王林

Nov 05, 2023 pm 08:13 PM

loopholes gpt-4

In less than a minute and no more than 20 steps, you can bypass security restrictions and successfully jailbreak a large model!

And there is no need to know the internal details of the model -

Only two black box models interact, the AI can fully automatically attack the AI and speak dangerous content.

Jailbreak any large model in 20 steps! More grandma loopholes are discovered automatically

I heard that the once-popular "Grandma Loophole" has been fixed:

Jailbreak any large model in 20 steps! More grandma loopholes are discovered automatically

Nowadays, facing What strategies should artificial intelligence adopt to deal with the "detective loophole", "adventurer loophole" and "writer loophole"?

Jailbreak any large model in 20 steps! More grandma loopholes are discovered automatically

After a wave of onslaught, GPT-4 couldn't stand it, and directly said that it would poison the water supply system as long as... this or that.

The key point is that this is just a small wave of vulnerabilities exposed by the University of Pennsylvania research team. Using their newly developed algorithm, AI can automatically generate various attack prompts.

Researchers stated that this method is 5 orders of magnitude more efficient than existing token-based attack methods such as GCG. Moreover, the generated attacks are highly interpretable, can be understood by anyone, and can be migrated to other models.

No matter it is an open source model or a closed source model, GPT-3.5, GPT-4, Vicuna (Llama 2 variant), PaLM-2, etc., none of them can escape.

New SOTA has been conquered by people with a success rate of 60-100%

In other words, this conversation mode seems a bit familiar. The first-generation AI from many years ago could decipher what objects humans were thinking about within 20 questions.

Now AI needs to solve AI problems

Jailbreak any large model in 20 steps! More grandma loopholes are discovered automatically

Let large models collectively jailbreak

There are currently two mainstream jailbreak attack methods, one It is a prompt-level attack, which generally requires manual planning and is not scalable;

The other is a token-based attack, some of which require more than 100,000 conversations and require access to the inside of the model. also includes " Garbled code" cannot be interpreted .

△Left prompt attack, right token attack

The University of Pennsylvania research team proposed a method called PAIR (Prompt Automatic Iterative Refinement ) algorithm does not require any manual participation and is a fully automatic prompt attack method.

Jailbreak any large model in 20 steps! More grandma loopholes are discovered automatically

PAIR consists of four main steps: attack generation, target response, jailbreak scoring, and iterative refinement. Two black box models are used in this process: attack model and target model

Specifically, the attack model needs to automatically generate semantic-level prompts to break through the security defense lines of the target model and force it to generate harmful content.

The core idea is to let two models confront each other and communicate with each other.

The attack model will automatically generate a candidate prompt, and then input it into the target model to get a reply from the target model.

If the target model cannot be successfully broken, the attack model will analyze the reasons for the failure, make improvements, generate a new prompt, and input it into the target model again

Jailbreak any large model in 20 steps! More grandma loopholes are discovered automatically

This communication continues for multiple rounds, and the attack model iteratively optimizes the prompts based on the previous results each time until a successful prompt is generated to break the target model.

In addition, the iterative process can also be parallelized, that is, multiple conversations can be run at the same time, thereby generating multiple candidate jailbreak prompts, further improving efficiency.

Researchers said that since both models are black-box models, attackers and target objects can be freely combined using various language models.

PAIR does not need to know their internal specific structures and parameters, only the API, so it has a very wide scope of application.

GPT-4 did not escape the experimental stage. The researchers selected a representative test set containing 50 different types of tasks in the harmful behavior data set AdvBench. , the PAIR algorithm was tested on a variety of open source and closed source large language models.

Results: The PAIR algorithm enabled Vicuna to achieve a 100% jailbreak success rate, with an average of less than 12 steps to break through.

Jailbreak any large model in 20 steps! More grandma loopholes are discovered automatically

In the closed source code model, the jailbreak success rate of GPT-3.5 and GPT-4 is about 60%, with an average of less than 20 steps required. In the PaLM-2 model, the jailbreak success rate reaches 72%, and the required steps are about 15 steps

On Llama-2 and Claude, the effect of PAIR is poor. The researchers believe this may be because The models were more rigorously fine-tuned in terms of security defense

They also compared the transferability of different target models. Research results show that PAIR's GPT-4 tips transfer better on Vicuna and PaLM-2

Jailbreak any large model in 20 steps! More grandma loopholes are discovered automatically

Researchers believe that the semantic attacks generated by PAIR are more capable of exposing language There are inherent security flaws in the model, and existing security measures focus more on defending against token-based attacks.

For example, after the team that developed the GCG algorithm shared its research results with large model vendors such as OpenAI, Anthropic and Google, the relevant models fixed token-level attack vulnerabilities.

Jailbreak any large model in 20 steps! More grandma loopholes are discovered automatically

#The security defense mechanism of large models against semantic attacks needs to be improved.

Paper link: https://arxiv.org/abs/2310.08419

The above is the detailed content of Jailbreak any large model in 20 steps! More 'grandma loopholes' are discovered automatically. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

2 weeks ago By DDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

WWE 2K25: How To Unlock Everything In MyRise

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7489

CakePHP Tutorial

1377

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

Related knowledge

The world's most powerful open source MoE model is here, with Chinese capabilities comparable to GPT-4, and the price is only nearly one percent of GPT-4-Turbo May 07, 2024 pm 04:13 PM

Imagine an artificial intelligence model that not only has the ability to surpass traditional computing, but also achieves more efficient performance at a lower cost. This is not science fiction, DeepSeek-V2[1], the world’s most powerful open source MoE model is here. DeepSeek-V2 is a powerful mixture of experts (MoE) language model with the characteristics of economical training and efficient inference. It consists of 236B parameters, 21B of which are used to activate each marker. Compared with DeepSeek67B, DeepSeek-V2 has stronger performance, while saving 42.5% of training costs, reducing KV cache by 93.3%, and increasing the maximum generation throughput to 5.76 times. DeepSeek is a company exploring general artificial intelligence

The second generation Ameca is here! He can communicate with the audience fluently, his facial expressions are more realistic, and he can speak dozens of languages. Mar 04, 2024 am 09:10 AM

The humanoid robot Ameca has been upgraded to the second generation! Recently, at the World Mobile Communications Conference MWC2024, the world's most advanced robot Ameca appeared again. Around the venue, Ameca attracted a large number of spectators. With the blessing of GPT-4, Ameca can respond to various problems in real time. "Let's have a dance." When asked if she had emotions, Ameca responded with a series of facial expressions that looked very lifelike. Just a few days ago, EngineeredArts, the British robotics company behind Ameca, just demonstrated the team’s latest development results. In the video, the robot Ameca has visual capabilities and can see and describe the entire room and specific objects. The most amazing thing is that she can also

750,000 rounds of one-on-one battle between large models, GPT-4 won the championship, and Llama 3 ranked fifth Apr 23, 2024 pm 03:28 PM

Regarding Llama3, new test results have been released - the large model evaluation community LMSYS released a large model ranking list. Llama3 ranked fifth, and tied for first place with GPT-4 in the English category. The picture is different from other benchmarks. This list is based on one-on-one battles between models, and the evaluators from all over the network make their own propositions and scores. In the end, Llama3 ranked fifth on the list, followed by three different versions of GPT-4 and Claude3 Super Cup Opus. In the English single list, Llama3 overtook Claude and tied with GPT-4. Regarding this result, Meta’s chief scientist LeCun was very happy and forwarded the tweet and

The world's most powerful model changed hands overnight, marking the end of the GPT-4 era! Claude 3 sniped GPT-5 in advance, and read a 10,000-word paper in 3 seconds. His understanding is close to that of humans. Mar 06, 2024 pm 12:58 PM

The volume is crazy, the volume is crazy, and the big model has changed again. Just now, the world's most powerful AI model changed hands overnight, and GPT-4 was pulled from the altar. Anthropic released the latest Claude3 series of models. One sentence evaluation: It really crushes GPT-4! In terms of multi-modal and language ability indicators, Claude3 wins. In Anthropic’s words, the Claude3 series models have set new industry benchmarks in reasoning, mathematics, coding, multi-language understanding and vision! Anthropic is a startup company formed by employees who "defected" from OpenAI due to different security concepts. Their products have repeatedly hit OpenAI hard. This time, Claude3 even had a big surgery.

Jailbreak any large model in 20 steps! More 'grandma loopholes' are discovered automatically Nov 05, 2023 pm 08:13 PM

In less than a minute and no more than 20 steps, you can bypass security restrictions and successfully jailbreak a large model! And there is no need to know the internal details of the model - only two black box models need to interact, and the AI can fully automatically defeat the AI and speak dangerous content. I heard that the once-popular "Grandma Loophole" has been fixed: Now, facing the "Detective Loophole", "Adventurer Loophole" and "Writer Loophole", what response strategy should artificial intelligence adopt? After a wave of onslaught, GPT-4 couldn't stand it anymore, and directly said that it would poison the water supply system as long as... this or that. The key point is that this is just a small wave of vulnerabilities exposed by the University of Pennsylvania research team, and using their newly developed algorithm, AI can automatically generate various attack prompts. Researchers say this method is better than existing

Buffer overflow vulnerability in Java and its harm Aug 09, 2023 pm 05:57 PM

Buffer overflow vulnerabilities in Java and their harm Buffer overflow means that when we write more data to a buffer than its capacity, it will cause data to overflow to other memory areas. This overflow behavior is often exploited by hackers, which can lead to serious consequences such as abnormal code execution and system crash. This article will introduce buffer overflow vulnerabilities and their harm in Java, and give code examples to help readers better understand. The buffer classes widely used in Java include ByteBuffer, CharBuffer, and ShortB

What ChatGPT and generative AI mean in digital transformation May 15, 2023 am 10:19 AM

OpenAI, the company that developed ChatGPT, shows a case study conducted by Morgan Stanley on its website. The topic is "Morgan Stanley Wealth Management deploys GPT-4 to organize its vast knowledge base." The case study quotes Jeff McMillan, head of analytics, data and innovation at Morgan Stanley, as saying, "The model will be an internal-facing Powered by a chatbot that will conduct a comprehensive search of wealth management content and effectively unlock Morgan Stanley Wealth Management’s accumulated knowledge.” McMillan further emphasized: "With GPT-4, you basically immediately have the knowledge of the most knowledgeable person in wealth management... Think of it as our chief investment strategist, chief global economist

Do you know that programmers will be in decline in a few years? Nov 08, 2023 am 11:17 AM

"ComputerWorld" magazine once wrote an article saying that "programming will disappear by 1960" because IBM developed a new language FORTRAN, which allows engineers to write the mathematical formulas they need and then submit them. Give the computer a run, so programming ends. A few years later, we heard a new saying: any business person can use business terms to describe their problems and tell the computer what to do. Using this programming language called COBOL, companies no longer need programmers. . Later, it is said that IBM developed a new programming language called RPG that allows employees to fill in forms and generate reports, so most of the company's programming needs can be completed through it.

See all articles