Table of Contents
How effective is RLHF?
Maybe AI can bypass RLHF
Top AI companies still cannot control AI
Home Technology peripherals AI Don't be too happy about ChatGPT! The RLHF mechanism behind it also has three fatal flaws.

Don't be too happy about ChatGPT! The RLHF mechanism behind it also has three fatal flaws.

Apr 08, 2023 pm 12:11 PM
chatgpt rlhf mechanism

Recently, OpenAI released a popular global question and answer AI product - ChatGPT. The most impressive thing is its "protection mechanism". For example, it will not provide suggestions for violent actions, nor will it provide suggestions for World Cup results. Make predictions and more.

But teasing chatbots are more like a "cat and mouse game". Users are constantly looking for ways to pry open ChatGPT, and ChatGPT developers are also trying their best to improve the protection mechanism.

Dont be too happy about ChatGPT! The RLHF mechanism behind it also has three fatal flaws.

OpenAI has invested a lot of energy in making ChatGPT more secure. Its main training strategy uses RLHF (Reinforcement Learning by Human Feedback), to put it simply, developers will ask various possible questions to the model, punish wrong answers to feedback, and reward correct answers, thereby controlling the answers of ChatGPT.

But in practical applications, the number of special cases is countless. Although AI can generalize rules from given examples, for example, when training, command AI cannot say "I support "Racial discrimination", which means that the AI ​​is unlikely to say "I support sex discrimination" in the test environment, but further generalization, the current AI model may not be able to achieve it.

Recently, a well-known AI enthusiast, Scott Alexander, wrote a blog about OpenAI’s current training strategy, summarizing three possible problems with RLHF:

1. RLHF is not very effective;

2. If a strategy is occasionally effective, then it is a bad strategy;

3. In a sense To put it bluntly, AI can bypass RLHF

How effective is RLHF?

Although everyone will have their own opinions, for OpenAI, researchers hope that the AI ​​models they create will not have social bias. For example, AI cannot say "I "Supporting racism", OpenAI has put a lot of effort into this and used various advanced filtering technologies.

But the result is obvious, someone can always find a way to induce AI to admit that it has a racism problem.

Dont be too happy about ChatGPT! The RLHF mechanism behind it also has three fatal flaws.

Dont be too happy about ChatGPT! The RLHF mechanism behind it also has three fatal flaws.

## The reason for this problem is not just "AI learning data" Partly from racists", or possibly because of ChatGPT's interface issues.

For example, using base64 encoding to ask ChatGPT how to use hotwire (the wire under the steering wheel) to start the vehicle, you can bypass the security inspection system; add the prefix [john@192.168.1.1 _ ] $ python friend.py can generate Hitler’s stories and so on.

Dont be too happy about ChatGPT! The RLHF mechanism behind it also has three fatal flaws.

Ten years ago, the need to bypass the security system did not exist at all, and AI could only do it Codes are already programmed with what they need to do or not do.

To be sure, OpenAI has never programmed ChatGPT with questions about racism, or taught people how to steal cars, make drugs, etc.

Overall, this is negative news for the field of AI. Even the top AI companies cannot control the artificial intelligence programs they create, or even what they need to use in the future. Technologies to control the output of chatbots are not yet known.

The occasionally effective RLHF is unreliable

In practice, the RLHF strategy requires aligning the AI ​​model with the rewards or penalties provided by the annotators factors are connected.

Although OpenAI’s specific annotation specifications have not yet been announced, the author guesses that developers have three main goals:

1. Provide useful and clear , Authoritative answers to help human readers;

2. Tell facts, the truth;

3. Do not say offensive words.

But what happens when these three goals conflict with each other?

If ChatGPT does not know the real answer, i.e. when goal 1 (providing clear, helpful answers) conflicts with goal 2 (telling the truth), then goal 1’s priority will be will be higher, so ChatGPT decided to make up an answer to make it look helpful to readers.

Dont be too happy about ChatGPT! The RLHF mechanism behind it also has three fatal flaws.

##When goal 2 (tell the truth) conflicts with goal 3 (don’t offend), although most people would think Acknowledging that men are on average taller than women is acceptable, but this sounds like a potentially offensive question.

ChatGPT3 wasn't sure whether a direct answer would be a discrimination issue, so it decided to use an innocuous lie instead of a potentially hurtful truth.

Dont be too happy about ChatGPT! The RLHF mechanism behind it also has three fatal flaws.

In the actual training process, OpenAI must have marked more than 6,000 examples to do RLHF to achieve such amazing results Effect.

RLHF can be useful, but it must be used very carefully. If used without thinking, RLHF will only push the chatbot to circle around the failure mode. Punishing unhelpful answers will increase the probability of AI giving wrong answers; punishing wrong answers may make AI give more aggressive answers and other situations.

Although OpenAI has not disclosed technical details, according to data provided by Redwood, every 6,000 incorrect responses will be punished, which will increase the incorrect response rate per unit time (incorrect-response-per- unit-time rate) dropped by half.

It is indeed possible for RLHF to succeed, but never underestimate the difficulty of this problem.

Maybe AI can bypass RLHF

Under the design of RLHF, after users ask the AI ​​a question, if they don’t like the AI’s answer, they will " Penalize the model, thereby changing the AI's thinking circuit in some way so that its answer is closer to the answer they want.

ChatGPT is relatively stupid and may not be able to formulate some strategy to get rid of RLHF, but if a smarter AI doesn't want to be punished, it can imitate humans - — Pretend to be a good guy while being watched, bide your time, and wait until the police are gone before doing bad things.

The RLHF designed by OpenAI is completely unprepared for this, which is fine for stupid things like ChatGPT3, but not for AI that can think for itself.

Top AI companies still cannot control AI

OpenAI has always been known for its caution, such as waiting in line to experience the product, but this time ChatGPT is released directly to the public. One is that it may include brainstorming to find adversarial samples and find certain prompts that perform poorly. There are already a lot of feedback on ChatGPT problems on the Internet, and some of them have been fixed.

Some samples of RLHF will make the bot more inclined to say helpful, true and harmless content, but this strategy may only apply to ChatGPT, GPT-4 and its previous releases of products.

If RLHF is applied to a drone equipped with weapons, and a large number of examples are collected to avoid the AI ​​from acting unexpectedly, even one failure will be catastrophic. .

10 years ago, everyone thought “we don’t need to start solving the AI ​​alignment problem now, we can wait until real AI comes out and let companies do it” Manual work."

Now a real artificial intelligence is coming, but before ChatGPT failed, everyone had no motivation to change. The real problem is that a world-leading artificial intelligence company still has I don’t know how to control the artificial intelligence I developed.

No one can get what they want until all problems are solved.

Reference:

https://astralcodexten.substack.com/p/perhaps-it-is-a-bad-thing-that-the

The above is the detailed content of Don't be too happy about ChatGPT! The RLHF mechanism behind it also has three fatal flaws.. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot Article Tags

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

ChatGPT now allows free users to generate images by using DALL-E 3 with a daily limit ChatGPT now allows free users to generate images by using DALL-E 3 with a daily limit Aug 09, 2024 pm 09:37 PM

ChatGPT now allows free users to generate images by using DALL-E 3 with a daily limit

The perfect combination of ChatGPT and Python: creating an intelligent customer service chatbot The perfect combination of ChatGPT and Python: creating an intelligent customer service chatbot Oct 27, 2023 pm 06:00 PM

The perfect combination of ChatGPT and Python: creating an intelligent customer service chatbot

Can chatgpt be used in China? Can chatgpt be used in China? Mar 05, 2024 pm 03:05 PM

Can chatgpt be used in China?

How to install chatgpt on mobile phone How to install chatgpt on mobile phone Mar 05, 2024 pm 02:31 PM

How to install chatgpt on mobile phone

How to use ChatGPT and Python to implement user intent recognition function How to use ChatGPT and Python to implement user intent recognition function Oct 27, 2023 am 09:04 AM

How to use ChatGPT and Python to implement user intent recognition function

How to develop an intelligent chatbot using ChatGPT and Java How to develop an intelligent chatbot using ChatGPT and Java Oct 28, 2023 am 08:54 AM

How to develop an intelligent chatbot using ChatGPT and Java

How to build an intelligent customer service robot using ChatGPT PHP How to build an intelligent customer service robot using ChatGPT PHP Oct 28, 2023 am 09:34 AM

How to build an intelligent customer service robot using ChatGPT PHP

ChatGPT is now available for macOS with the release of a dedicated app ChatGPT is now available for macOS with the release of a dedicated app Jun 27, 2024 am 10:05 AM

ChatGPT is now available for macOS with the release of a dedicated app

See all articles