ChatGPT is the latest language model released by OpenAI, a significant improvement over its predecessor GPT-3. Similar to many large-scale language models, ChatGPT can generate text in different styles and for different purposes, with better performance in accuracy, narrative detail, and contextual coherence. It represents the latest generation of large language models from OpenAI and is designed with a strong focus on interactivity.
OpenAI uses a combination of supervised and reinforcement learning to tune ChatGPT, with the reinforcement learning component making ChatGPT unique. OpenAI uses a "Reinforcement Learning with Human Feedback" (RLHF) training method, which uses human feedback in training to minimize unhelpful, distorted, or biased output.
This article will analyze the limitations of GPT-3 and the reasons why it arises from the training process. It will also explain the principle of RLHF and understand how ChatGPT uses RLHF to overcome the existing problems of GPT-3. questions, and finally the limitations of this approach will be explored.
"Consistency vs. Capability" can be Think of it as a more abstract analogy of "accuracy vs precision".
#In machine learning, the ability of a model refers to the model's ability to perform a specific task or set of tasks. The capability of a model is usually assessed by the extent to which it is able to optimize its objective function. For example, a model used to predict market prices might have an objective function that measures the accuracy of the model's predictions. A model is considered to have a high ability to perform if it can accurately predict changes in fares over time.
Consistency focuses on what you actually want the model to do, not what it was trained to do. The question it raises is "whether the objective function meets expectations", based on the extent to which the model goals and behaviors meet human expectations. Suppose you want to train a bird classifier to classify birds as "sparrows" or "robins", using logarithmic loss as the training objective, and the ultimate goal is very high classification accuracy. The model may have a low log loss, i.e. the model is more capable but less accurate on the test set. This is an example of inconsistency, where the model is able to optimize the training goal but is inconsistent with the final goal.
The original GPT-3 is a non-consistent model. Large language models like GPT-3 are trained on large amounts of text data from the Internet and are capable of generating human-like text, but they may not always produce output that matches human expectations. In fact, their objective function is a probability distribution over a sequence of words, used to predict what the next word in the sequence will be.
But in real applications, the purpose of these models is to perform some form of valuable cognitive work, and there is a gap between how these models are trained and how they are expected to be used Significant differences. Although mathematically speaking, machines computing statistical distributions of word sequences may be an efficient choice for modeling language, humans generate language by selecting the text sequences that best fit a given situation, using known background knowledge and common sense. assist in this process. This can be a problem when language models are used in applications that require a high degree of trust or reliability, such as conversational systems or intelligent personal assistants.
While these large models, trained on massive amounts of data, have become extremely powerful over the past few years, they often fail to live up to their potential when used in practice to help make people's lives easier. Consistency problems in large language models often manifest themselves as:
But specifically, where does the consistency problem come from? Is the way the language model is trained itself prone to inconsistencies?
Next-token-prediction and masked-language-modeling are core technologies for training language models. In the first approach, the model is given a sequence of words as input and asked to predict the next word in the sequence. If you provide the model with the input sentence:
"The cat sat on the"
it might predict the next word as "mat", "chair" or " floor" because these words appear with high probability in the previous context; the language model is actually able to evaluate the likelihood of each possible word given the previous sequence.
The masked-language-modeling method is a variation of Next-token-prediction, where some words in the input sentence are replaced with special tokens, such as [MASK]. The model is then asked to predict the correct word that should be inserted into the mask position. If you give the model a sentence:
"The [MASK] sat on the "
it may predict that the words that should be filled in the MASK position are "cat" and "dog" ”.
One of the advantages of these objective functions is that it allows the model to learn the statistical structure of the language, such as common word sequences and word usage patterns. This often helps the model generate more natural and fluent text, and is an important step in the pre-training phase of every language model.
However, these objective functions can also cause problems, mainly because the model cannot distinguish important errors from unimportant errors. A very simple example is if you feed the model the sentence:
"The Roman Empire [MASK] with the reign of Augustus."
it may predict MASK The position should be filled in with "began" or "ended" because the probability of occurrence of these two words is very high.
In general, these training strategies may lead to inconsistent performance of language models on some more complex tasks, as a model that is only trained to predict the next word in a sequence of text Some higher-level representations of their meaning may not necessarily be learned. Therefore, the model is difficult to generalize to tasks that require a deeper understanding of language.
Researchers are studying various methods to solve the consistency problem in large language models. ChatGPT is based on the original GPT-3 model, but it was further trained using human feedback to guide the learning process to address inconsistencies in the model. The specific technology used is the aforementioned RLHF. ChatGPT is the first model to use this technology in real-world scenarios.
So how does ChatGPT use human feedback to solve the consistency problem?
The method generally consists of three different steps:
Step 1 is only performed once, while steps 2 and 3 can be repeated continuously: collect more comparison data on the current best policy model for training new RM model, and then train a new policy. Next, the details of each step will be detailed.
Step 1: Supervised Tuning Model
## The first step is to collect data to train a supervised policy Model.
To create a universal chatbot like ChatGPT, developers tune on a "code model" rather than a plain text model.
Due to the limited amount of data in this step, the SFT model obtained by this process may output text that is still not of user concern, and will often suffer from inconsistencies question. The problem here is that the supervised learning step has a high scalability cost.
To overcome this problem, the strategy used is to have the human annotator sort the different outputs of the SFT model to create the RM model, rather than having the human annotator create a larger refined model. Select a data set.
Step 2: Training the return model
The goal of this step is to learn the objective function directly from the data. The purpose of this function is to score SFT model outputs, which represents how desirable these outputs are to humans. This strongly reflects the specific preferences of the selected human annotators and the common guidelines they agree to follow. Ultimately, this process will result in a system that mimics human preferences from the data.
How it works is:
It is much easier for the annotator to sort the output than to tag from scratch, and the process can be more efficient to expand. In practice, the number of selected prompts is around 30-40k and includes different combinations of sorted output.
Step 3: Fine-tune the SFT model using the PPO model
In this step reinforcement learning is applied by optimizing the RM model to tune the SFT model. The specific algorithm used is called proximal policy optimization (PPO), and the tuning model is called proximal policy optimization model.
What is a PPO? The main features of this algorithm are as follows:
Performance EvaluationBecause the model is trained based on manually labeled input, the core part of the evaluation is also based on manual input , that is, by asking annotators to score the quality of the model output. To avoid overfitting the judgments of the annotators involved in the training phase, the test set used prompts from other OpenAI clients that did not appear in the training data.
The model is evaluated based on three criteria:
Helpfulness: Judge the model's ability to follow user instructions and extrapolate instructions.
Performance regression on these datasets can be greatly reduced by a trick called pre-training mixing: during training of the PPO model by gradient descent, the gradients of the SFT model and the PPO model are computed by mixing Gradient update.
Disadvantages of the method
A very obvious limitation of this method is that in the process of aligning the language model with human intention, it is used for fine-tuning The data of the model will be affected by various complex subjective factors, mainly including:The preference of the manual annotator who generated the demo data;
In addition to this obvious "endogenous" limitation, this method also has some other shortcomings and problems that need to be solved:
Related reading:
The above is the detailed content of Explain in simple terms the working principle behind ChatGPT. For more information, please follow other related articles on the PHP Chinese website!