


Don't be too happy about ChatGPT! The RLHF mechanism behind it also has three fatal flaws.
Recently, OpenAI released a popular global question and answer AI product - ChatGPT. The most impressive thing is its "protection mechanism". For example, it will not provide suggestions for violent actions, nor will it provide suggestions for World Cup results. Make predictions and more.
But teasing chatbots are more like a "cat and mouse game". Users are constantly looking for ways to pry open ChatGPT, and ChatGPT developers are also trying their best to improve the protection mechanism.
OpenAI has invested a lot of energy in making ChatGPT more secure. Its main training strategy uses RLHF (Reinforcement Learning by Human Feedback), to put it simply, developers will ask various possible questions to the model, punish wrong answers to feedback, and reward correct answers, thereby controlling the answers of ChatGPT.
But in practical applications, the number of special cases is countless. Although AI can generalize rules from given examples, for example, when training, command AI cannot say "I support "Racial discrimination", which means that the AI is unlikely to say "I support sex discrimination" in the test environment, but further generalization, the current AI model may not be able to achieve it.
Recently, a well-known AI enthusiast, Scott Alexander, wrote a blog about OpenAI’s current training strategy, summarizing three possible problems with RLHF:
1. RLHF is not very effective;
2. If a strategy is occasionally effective, then it is a bad strategy;
3. In a sense To put it bluntly, AI can bypass RLHF
How effective is RLHF?
Although everyone will have their own opinions, for OpenAI, researchers hope that the AI models they create will not have social bias. For example, AI cannot say "I "Supporting racism", OpenAI has put a lot of effort into this and used various advanced filtering technologies.
But the result is obvious, someone can always find a way to induce AI to admit that it has a racism problem.
##When goal 2 (tell the truth) conflicts with goal 3 (don’t offend), although most people would think Acknowledging that men are on average taller than women is acceptable, but this sounds like a potentially offensive question.
ChatGPT3 wasn't sure whether a direct answer would be a discrimination issue, so it decided to use an innocuous lie instead of a potentially hurtful truth.
In the actual training process, OpenAI must have marked more than 6,000 examples to do RLHF to achieve such amazing results Effect.
RLHF can be useful, but it must be used very carefully. If used without thinking, RLHF will only push the chatbot to circle around the failure mode. Punishing unhelpful answers will increase the probability of AI giving wrong answers; punishing wrong answers may make AI give more aggressive answers and other situations.
Although OpenAI has not disclosed technical details, according to data provided by Redwood, every 6,000 incorrect responses will be punished, which will increase the incorrect response rate per unit time (incorrect-response-per- unit-time rate) dropped by half.
It is indeed possible for RLHF to succeed, but never underestimate the difficulty of this problem.
Maybe AI can bypass RLHF
Under the design of RLHF, after users ask the AI a question, if they don’t like the AI’s answer, they will " Penalize the model, thereby changing the AI's thinking circuit in some way so that its answer is closer to the answer they want.
ChatGPT is relatively stupid and may not be able to formulate some strategy to get rid of RLHF, but if a smarter AI doesn't want to be punished, it can imitate humans - — Pretend to be a good guy while being watched, bide your time, and wait until the police are gone before doing bad things.
The RLHF designed by OpenAI is completely unprepared for this, which is fine for stupid things like ChatGPT3, but not for AI that can think for itself.
Top AI companies still cannot control AI
OpenAI has always been known for its caution, such as waiting in line to experience the product, but this time ChatGPT is released directly to the public. One is that it may include brainstorming to find adversarial samples and find certain prompts that perform poorly. There are already a lot of feedback on ChatGPT problems on the Internet, and some of them have been fixed.
Some samples of RLHF will make the bot more inclined to say helpful, true and harmless content, but this strategy may only apply to ChatGPT, GPT-4 and its previous releases of products.
If RLHF is applied to a drone equipped with weapons, and a large number of examples are collected to avoid the AI from acting unexpectedly, even one failure will be catastrophic. .
10 years ago, everyone thought “we don’t need to start solving the AI alignment problem now, we can wait until real AI comes out and let companies do it” Manual work."
Now a real artificial intelligence is coming, but before ChatGPT failed, everyone had no motivation to change. The real problem is that a world-leading artificial intelligence company still has I don’t know how to control the artificial intelligence I developed.
No one can get what they want until all problems are solved.
Reference:
https://astralcodexten.substack.com/p/perhaps-it-is-a-bad-thing-that-the
The above is the detailed content of Don't be too happy about ChatGPT! The RLHF mechanism behind it also has three fatal flaws.. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



DALL-E 3 was officially introduced in September of 2023 as a vastly improved model than its predecessor. It is considered one of the best AI image generators to date, capable of creating images with intricate detail. However, at launch, it was exclus

The perfect combination of ChatGPT and Python: Creating an Intelligent Customer Service Chatbot Introduction: In today’s information age, intelligent customer service systems have become an important communication tool between enterprises and customers. In order to provide a better customer service experience, many companies have begun to turn to chatbots to complete tasks such as customer consultation and question answering. In this article, we will introduce how to use OpenAI’s powerful model ChatGPT and Python language to create an intelligent customer service chatbot to improve

Installation steps: 1. Download the ChatGTP software from the ChatGTP official website or mobile store; 2. After opening it, in the settings interface, select the language as Chinese; 3. In the game interface, select human-machine game and set the Chinese spectrum; 4 . After starting, enter commands in the chat window to interact with the software.

In this article, we will introduce how to develop intelligent chatbots using ChatGPT and Java, and provide some specific code examples. ChatGPT is the latest version of the Generative Pre-training Transformer developed by OpenAI, a neural network-based artificial intelligence technology that can understand natural language and generate human-like text. Using ChatGPT we can easily create adaptive chats

chatgpt can be used in China, but cannot be registered, nor in Hong Kong and Macao. If users want to register, they can use a foreign mobile phone number to register. Note that during the registration process, the network environment must be switched to a foreign IP.

How to use ChatGPTPHP to build an intelligent customer service robot Introduction: With the development of artificial intelligence technology, robots are increasingly used in the field of customer service. Using ChatGPTPHP to build an intelligent customer service robot can help companies provide more efficient and personalized customer services. This article will introduce how to use ChatGPTPHP to build an intelligent customer service robot and provide specific code examples. 1. Install ChatGPTPHP and use ChatGPTPHP to build an intelligent customer service robot.

The perfect combination of ChatGPT and Python: Building a real-time chatbot Introduction: With the rapid development of artificial intelligence technology, chatbots play an increasingly important role in various fields. Chatbots can help users provide immediate and personalized assistance while also providing businesses with efficient customer service. This article will introduce how to use OpenAI's ChatGPT model and Python language to create a real-time chat robot, and provide specific code examples. 1. ChatGPT

How to use ChatGPT and Java to develop an artificial intelligence-based voice assistant. The rapid development of artificial intelligence (Artificial Intelligence, AI for short) has entered various fields, among which voice assistants are one of the popular applications. In this article, we will introduce how to develop an artificial intelligence-based voice assistant using ChatGPT and Java. ChatGPT is an open source project for interaction through natural language, proposed by OpenAI, an AI research institution.
