With the emergence of ChatGPT and the widespread discussion that followed, obscure acronyms such as RLHF, SFT, IFT, CoT, etc. have appeared in the public eye, all due to the success of ChatGPT. What are these obscure acronyms? Why are they so important? The author of this article reviewed all important papers on these topics and classified and summarized them.
ChatGPT is not the first language model (LM)-based dialogue agent. In fact, many organizations have released language model dialogue agents before OpenAI, including Meta BlenderBot, Google LaMDA, DeepMind’s Sparrow and Anthropic Assistant. Some organizations have also announced plans to build open-source chatbots and made their roadmaps public (such as LAION’s Open-Assistant). There are certainly other organizations doing the same thing, just not announced.
The table below compares the AI chatbots mentioned above based on whether they are publicly accessible, training data, model architecture, and evaluation details. There is no relevant data for ChatGPT. The information used here is InstructGPT, which is a fine-tuned model of OpenAI and is considered the basis of ChatGPT.
While there are many differences in training data, models and fine-tuning, these chatbots also share some commonalities - Instruction following means responding to the user's instructions. For example, ask ChatGPT to write a poem about nudges.
In general, basic language modeling has insufficient goals To allow the model to efficiently follow the user's instructions. Model creators also use Instruction Fine-Tuning (IFT), which can fine-tune basic models on diverse tasks and can also be applied to classic NLP tasks such as sentiment analysis, text classification, and summarization.
IFT mainly consists of three parts: instructions, input and output. Input is optional, and some tasks only require instructions, like the ChatGPT example above. Input and output constitute an instance. A given instruction can have multiple inputs and outputs. A relevant example is as follows ([Wang et al., ‘22]).
IFT data is typically bootstrapped using human instructions and language models. For bootstrapping, LM generates new instructions, inputs, and outputs based on prompts with zero samples. In each round, the model is prompted with samples selected from humans writing and generating the model. The contribution of humans and models to a dataset can be expressed as a spectrum, as shown in the figure below.
One is an IFT dataset generated by a pure model such as Unnatural Instructions, and the other is a collection of community efforts and manual Create instructions like Super natural Instructions. In between, choose a high-quality seed data set and then bootstrap such as Self-instruct. Another way to collect datasets for IFT is to use existing high-quality crowdsourced NLP datasets for various tasks (including prompting) and use these datasets as instructions using unified patterns or different templates. Related work includes T0, Natural instructions dataset, FLAN LM and OPT-IML.
LM Using fine-tuned instructions may not always generate useful and safe responses. Examples of this behavior include invalid responses (subterfuges), always giving invalid responses such as "Sorry, I don't understand," or responding unsafely to user input about sensitive topics.
To solve this problem, model developers use Supervised Fine-tuning (SFT) to fine-tune the underlying language model on high-quality human labeled data to achieve effective and safe responses.
SFT and IFT are closely linked. Instruction tuning can be thought of as a subset of supervised fine-tuning. In recent literature, the SFT phase is generally used for security topics rather than instruction-specific topics that follow IFT. This classification and description will have clearer use cases and methods in the future.
Google’s LaMDA also fine-tunes a conversation data set with security annotations based on a set of rules. These rules are typically predefined and enacted by the model creator and cover a broad range of topics such as harmfulness, discrimination, and misinformation.
On the other hand, OpenAI’s InstructGPT, DeepMind’s Sparrow and Anthropic’s ConstitutionalAI all use reinforcement learning from human feedback, RLHF) technology. In RLHF, model responses are ranked based on human feedback (such as choosing a better answer), then the model is trained with these annotated responses to return scalar rewards to the RL optimizer, and finally a conversational agent is trained via reinforcement learning to simulate Preference model.
Chain-of-thought (CoT) is a special case of instruction demonstration that generates output by eliciting step-by-step reasoning from the conversational agent. Models fine-tuned with CoT use a dataset of human-annotated instructions with step-by-step inference. As shown in the example below, orange markers represent instructions, pink markers represent inputs and outputs, and blue markers represent CoT reasoning.
Models fine-tuned with CoT perform better on tasks involving common sense, arithmetic, and symbolic reasoning. Fine-tuning with CoT has also been shown to be very effective at achieving harmlessness (sometimes better than RLHF), and the model doesn't shy away from generating "Sorry, I can't answer this question" responses.
##Summary of key points
The main points of this article are summarized as follows:
1. Compared with pre-training data, only a very small part of the data is needed to fine-tune the instructions.
2. Supervised fine-tuning uses manual annotation to make model output safer and more helpful.
3. CoT fine-tuning improves the performance of models on step-by-step thinking tasks and reduces their invalid responses or avoidance on sensitive topics.
Thoughts on further work on dialogue agentsFinally, the author gives some of his own thoughts on the future development of dialogue agents.
1. How important is RL in learning from human feedback? Can the same performance as RLHF be obtained by training on high-quality data in IFT or SFT?
2. How safe is using SFT RLHF in Sparrow compared to using SFT in LaMDA?
3. What level of pre-training is required for IFT, SFT, CoT and RLHF? What is tradeoff? What is the best base model that should be used?
4. Many of the models introduced in this article are carefully designed, and engineers specifically collect patterns that lead to failure and improve future training (prompts and methods) based on the problems that have been dealt with. How can the effects of these methods be systematically documented and reproduced?
The above is the detailed content of These technologies are used by ChatGPT and its potential competitors. For more information, please follow other related articles on the PHP Chinese website!