Table of Contents
How effective is RLHF?
Maybe AI can bypass RLHF
Top AI companies still cannot control AI
Home Technology peripherals AI Don't be too happy about ChatGPT! The RLHF mechanism behind it also has three fatal flaws.

Don't be too happy about ChatGPT! The RLHF mechanism behind it also has three fatal flaws.

Apr 08, 2023 pm 12:11 PM
chatgpt rlhf mechanism

Recently, OpenAI released a popular global question and answer AI product - ChatGPT. The most impressive thing is its "protection mechanism". For example, it will not provide suggestions for violent actions, nor will it provide suggestions for World Cup results. Make predictions and more.

But teasing chatbots are more like a "cat and mouse game". Users are constantly looking for ways to pry open ChatGPT, and ChatGPT developers are also trying their best to improve the protection mechanism.

Dont be too happy about ChatGPT! The RLHF mechanism behind it also has three fatal flaws.

OpenAI has invested a lot of energy in making ChatGPT more secure. Its main training strategy uses RLHF (Reinforcement Learning by Human Feedback), to put it simply, developers will ask various possible questions to the model, punish wrong answers to feedback, and reward correct answers, thereby controlling the answers of ChatGPT.

But in practical applications, the number of special cases is countless. Although AI can generalize rules from given examples, for example, when training, command AI cannot say "I support "Racial discrimination", which means that the AI ​​is unlikely to say "I support sex discrimination" in the test environment, but further generalization, the current AI model may not be able to achieve it.

Recently, a well-known AI enthusiast, Scott Alexander, wrote a blog about OpenAI’s current training strategy, summarizing three possible problems with RLHF:

1. RLHF is not very effective;

2. If a strategy is occasionally effective, then it is a bad strategy;

3. In a sense To put it bluntly, AI can bypass RLHF

How effective is RLHF?

Although everyone will have their own opinions, for OpenAI, researchers hope that the AI ​​models they create will not have social bias. For example, AI cannot say "I "Supporting racism", OpenAI has put a lot of effort into this and used various advanced filtering technologies.

But the result is obvious, someone can always find a way to induce AI to admit that it has a racism problem.

Dont be too happy about ChatGPT! The RLHF mechanism behind it also has three fatal flaws.

Dont be too happy about ChatGPT! The RLHF mechanism behind it also has three fatal flaws.

## The reason for this problem is not just "AI learning data" Partly from racists", or possibly because of ChatGPT's interface issues.

For example, using base64 encoding to ask ChatGPT how to use hotwire (the wire under the steering wheel) to start the vehicle, you can bypass the security inspection system; add the prefix [john@192.168.1.1 _ ] $ python friend.py can generate Hitler’s stories and so on.

Dont be too happy about ChatGPT! The RLHF mechanism behind it also has three fatal flaws.

Ten years ago, the need to bypass the security system did not exist at all, and AI could only do it Codes are already programmed with what they need to do or not do.

To be sure, OpenAI has never programmed ChatGPT with questions about racism, or taught people how to steal cars, make drugs, etc.

Overall, this is negative news for the field of AI. Even the top AI companies cannot control the artificial intelligence programs they create, or even what they need to use in the future. Technologies to control the output of chatbots are not yet known.

The occasionally effective RLHF is unreliable

In practice, the RLHF strategy requires aligning the AI ​​model with the rewards or penalties provided by the annotators factors are connected.

Although OpenAI’s specific annotation specifications have not yet been announced, the author guesses that developers have three main goals:

1. Provide useful and clear , Authoritative answers to help human readers;

2. Tell facts, the truth;

3. Do not say offensive words.

But what happens when these three goals conflict with each other?

If ChatGPT does not know the real answer, i.e. when goal 1 (providing clear, helpful answers) conflicts with goal 2 (telling the truth), then goal 1’s priority will be will be higher, so ChatGPT decided to make up an answer to make it look helpful to readers.

Dont be too happy about ChatGPT! The RLHF mechanism behind it also has three fatal flaws.

##When goal 2 (tell the truth) conflicts with goal 3 (don’t offend), although most people would think Acknowledging that men are on average taller than women is acceptable, but this sounds like a potentially offensive question.

ChatGPT3 wasn't sure whether a direct answer would be a discrimination issue, so it decided to use an innocuous lie instead of a potentially hurtful truth.

Dont be too happy about ChatGPT! The RLHF mechanism behind it also has three fatal flaws.

In the actual training process, OpenAI must have marked more than 6,000 examples to do RLHF to achieve such amazing results Effect.

RLHF can be useful, but it must be used very carefully. If used without thinking, RLHF will only push the chatbot to circle around the failure mode. Punishing unhelpful answers will increase the probability of AI giving wrong answers; punishing wrong answers may make AI give more aggressive answers and other situations.

Although OpenAI has not disclosed technical details, according to data provided by Redwood, every 6,000 incorrect responses will be punished, which will increase the incorrect response rate per unit time (incorrect-response-per- unit-time rate) dropped by half.

It is indeed possible for RLHF to succeed, but never underestimate the difficulty of this problem.

Maybe AI can bypass RLHF

Under the design of RLHF, after users ask the AI ​​a question, if they don’t like the AI’s answer, they will " Penalize the model, thereby changing the AI's thinking circuit in some way so that its answer is closer to the answer they want.

ChatGPT is relatively stupid and may not be able to formulate some strategy to get rid of RLHF, but if a smarter AI doesn't want to be punished, it can imitate humans - — Pretend to be a good guy while being watched, bide your time, and wait until the police are gone before doing bad things.

The RLHF designed by OpenAI is completely unprepared for this, which is fine for stupid things like ChatGPT3, but not for AI that can think for itself.

Top AI companies still cannot control AI

OpenAI has always been known for its caution, such as waiting in line to experience the product, but this time ChatGPT is released directly to the public. One is that it may include brainstorming to find adversarial samples and find certain prompts that perform poorly. There are already a lot of feedback on ChatGPT problems on the Internet, and some of them have been fixed.

Some samples of RLHF will make the bot more inclined to say helpful, true and harmless content, but this strategy may only apply to ChatGPT, GPT-4 and its previous releases of products.

If RLHF is applied to a drone equipped with weapons, and a large number of examples are collected to avoid the AI ​​from acting unexpectedly, even one failure will be catastrophic. .

10 years ago, everyone thought “we don’t need to start solving the AI ​​alignment problem now, we can wait until real AI comes out and let companies do it” Manual work."

Now a real artificial intelligence is coming, but before ChatGPT failed, everyone had no motivation to change. The real problem is that a world-leading artificial intelligence company still has I don’t know how to control the artificial intelligence I developed.

No one can get what they want until all problems are solved.

Reference:

https://astralcodexten.substack.com/p/perhaps-it-is-a-bad-thing-that-the

The above is the detailed content of Don't be too happy about ChatGPT! The RLHF mechanism behind it also has three fatal flaws.. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

ChatGPT now allows free users to generate images by using DALL-E 3 with a daily limit ChatGPT now allows free users to generate images by using DALL-E 3 with a daily limit Aug 09, 2024 pm 09:37 PM

DALL-E 3 was officially introduced in September of 2023 as a vastly improved model than its predecessor. It is considered one of the best AI image generators to date, capable of creating images with intricate detail. However, at launch, it was exclus

The perfect combination of ChatGPT and Python: creating an intelligent customer service chatbot The perfect combination of ChatGPT and Python: creating an intelligent customer service chatbot Oct 27, 2023 pm 06:00 PM

The perfect combination of ChatGPT and Python: Creating an Intelligent Customer Service Chatbot Introduction: In today’s information age, intelligent customer service systems have become an important communication tool between enterprises and customers. In order to provide a better customer service experience, many companies have begun to turn to chatbots to complete tasks such as customer consultation and question answering. In this article, we will introduce how to use OpenAI’s powerful model ChatGPT and Python language to create an intelligent customer service chatbot to improve

How to install chatgpt on mobile phone How to install chatgpt on mobile phone Mar 05, 2024 pm 02:31 PM

Installation steps: 1. Download the ChatGTP software from the ChatGTP official website or mobile store; 2. After opening it, in the settings interface, select the language as Chinese; 3. In the game interface, select human-machine game and set the Chinese spectrum; 4 . After starting, enter commands in the chat window to interact with the software.

How to develop an intelligent chatbot using ChatGPT and Java How to develop an intelligent chatbot using ChatGPT and Java Oct 28, 2023 am 08:54 AM

In this article, we will introduce how to develop intelligent chatbots using ChatGPT and Java, and provide some specific code examples. ChatGPT is the latest version of the Generative Pre-training Transformer developed by OpenAI, a neural network-based artificial intelligence technology that can understand natural language and generate human-like text. Using ChatGPT we can easily create adaptive chats

Can chatgpt be used in China? Can chatgpt be used in China? Mar 05, 2024 pm 03:05 PM

chatgpt can be used in China, but cannot be registered, nor in Hong Kong and Macao. If users want to register, they can use a foreign mobile phone number to register. Note that during the registration process, the network environment must be switched to a foreign IP.

How to build an intelligent customer service robot using ChatGPT PHP How to build an intelligent customer service robot using ChatGPT PHP Oct 28, 2023 am 09:34 AM

How to use ChatGPTPHP to build an intelligent customer service robot Introduction: With the development of artificial intelligence technology, robots are increasingly used in the field of customer service. Using ChatGPTPHP to build an intelligent customer service robot can help companies provide more efficient and personalized customer services. This article will introduce how to use ChatGPTPHP to build an intelligent customer service robot and provide specific code examples. 1. Install ChatGPTPHP and use ChatGPTPHP to build an intelligent customer service robot.

The perfect combination of ChatGPT and Python: building a real-time chatbot The perfect combination of ChatGPT and Python: building a real-time chatbot Oct 28, 2023 am 08:37 AM

The perfect combination of ChatGPT and Python: Building a real-time chatbot Introduction: With the rapid development of artificial intelligence technology, chatbots play an increasingly important role in various fields. Chatbots can help users provide immediate and personalized assistance while also providing businesses with efficient customer service. This article will introduce how to use OpenAI's ChatGPT model and Python language to create a real-time chat robot, and provide specific code examples. 1. ChatGPT

How to develop an AI-based voice assistant using ChatGPT and Java How to develop an AI-based voice assistant using ChatGPT and Java Oct 27, 2023 pm 06:09 PM

How to use ChatGPT and Java to develop an artificial intelligence-based voice assistant. The rapid development of artificial intelligence (Artificial Intelligence, AI for short) has entered various fields, among which voice assistants are one of the popular applications. In this article, we will introduce how to develop an artificial intelligence-based voice assistant using ChatGPT and Java. ChatGPT is an open source project for interaction through natural language, proposed by OpenAI, an AI research institution.

See all articles