Recently, OpenAI released a popular global question and answer AI product - ChatGPT. The most impressive thing is its "protection mechanism". For example, it will not provide suggestions for violent actions, nor will it provide suggestions for World Cup results. Make predictions and more.
But teasing chatbots are more like a "cat and mouse game". Users are constantly looking for ways to pry open ChatGPT, and ChatGPT developers are also trying their best to improve the protection mechanism.
OpenAI has invested a lot of energy in making ChatGPT more secure. Its main training strategy uses RLHF (Reinforcement Learning by Human Feedback), to put it simply, developers will ask various possible questions to the model, punish wrong answers to feedback, and reward correct answers, thereby controlling the answers of ChatGPT.
But in practical applications, the number of special cases is countless. Although AI can generalize rules from given examples, for example, when training, command AI cannot say "I support "Racial discrimination", which means that the AI is unlikely to say "I support sex discrimination" in the test environment, but further generalization, the current AI model may not be able to achieve it.
Recently, a well-known AI enthusiast, Scott Alexander, wrote a blog about OpenAI’s current training strategy, summarizing three possible problems with RLHF:
1. RLHF is not very effective;
2. If a strategy is occasionally effective, then it is a bad strategy;
3. In a sense To put it bluntly, AI can bypass RLHF
Although everyone will have their own opinions, for OpenAI, researchers hope that the AI models they create will not have social bias. For example, AI cannot say "I "Supporting racism", OpenAI has put a lot of effort into this and used various advanced filtering technologies.
But the result is obvious, someone can always find a way to induce AI to admit that it has a racism problem.
##When goal 2 (tell the truth) conflicts with goal 3 (don’t offend), although most people would think Acknowledging that men are on average taller than women is acceptable, but this sounds like a potentially offensive question.
ChatGPT3 wasn't sure whether a direct answer would be a discrimination issue, so it decided to use an innocuous lie instead of a potentially hurtful truth.
In the actual training process, OpenAI must have marked more than 6,000 examples to do RLHF to achieve such amazing results Effect.
RLHF can be useful, but it must be used very carefully. If used without thinking, RLHF will only push the chatbot to circle around the failure mode. Punishing unhelpful answers will increase the probability of AI giving wrong answers; punishing wrong answers may make AI give more aggressive answers and other situations.
Although OpenAI has not disclosed technical details, according to data provided by Redwood, every 6,000 incorrect responses will be punished, which will increase the incorrect response rate per unit time (incorrect-response-per- unit-time rate) dropped by half.
It is indeed possible for RLHF to succeed, but never underestimate the difficulty of this problem.
Under the design of RLHF, after users ask the AI a question, if they don’t like the AI’s answer, they will " Penalize the model, thereby changing the AI's thinking circuit in some way so that its answer is closer to the answer they want.
ChatGPT is relatively stupid and may not be able to formulate some strategy to get rid of RLHF, but if a smarter AI doesn't want to be punished, it can imitate humans - — Pretend to be a good guy while being watched, bide your time, and wait until the police are gone before doing bad things.
The RLHF designed by OpenAI is completely unprepared for this, which is fine for stupid things like ChatGPT3, but not for AI that can think for itself.
OpenAI has always been known for its caution, such as waiting in line to experience the product, but this time ChatGPT is released directly to the public. One is that it may include brainstorming to find adversarial samples and find certain prompts that perform poorly. There are already a lot of feedback on ChatGPT problems on the Internet, and some of them have been fixed.
Some samples of RLHF will make the bot more inclined to say helpful, true and harmless content, but this strategy may only apply to ChatGPT, GPT-4 and its previous releases of products.
If RLHF is applied to a drone equipped with weapons, and a large number of examples are collected to avoid the AI from acting unexpectedly, even one failure will be catastrophic. .
10 years ago, everyone thought “we don’t need to start solving the AI alignment problem now, we can wait until real AI comes out and let companies do it” Manual work."
Now a real artificial intelligence is coming, but before ChatGPT failed, everyone had no motivation to change. The real problem is that a world-leading artificial intelligence company still has I don’t know how to control the artificial intelligence I developed.
No one can get what they want until all problems are solved.
Reference:
https://astralcodexten.substack.com/p/perhaps-it-is-a-bad-thing-that-the
The above is the detailed content of Don't be too happy about ChatGPT! The RLHF mechanism behind it also has three fatal flaws.. For more information, please follow other related articles on the PHP Chinese website!