We know that from Google T5 models to OpenAI GPT series large models, large language models (LLMs) have demonstrated impressive generalization capabilities, such as context learning and thought chain reasoning. At the same time, in order to make LLMs follow natural language instructions and complete real-world tasks, researchers have been exploring instruction fine-tuning methods for LLMs. This is done in two ways: using human-annotated prompts and feedback to fine-tune models on a wide range of tasks, or using public benchmarks and datasets augmented with manually or automatically generated instructions to supervise fine-tuning.
Among these methods, Self-Instruct fine-tuning is a simple and effective method that learns from the instruction following data generated by teacher LLMs of SOTA instruction fine-tuning, making LLMs comparable to humans Intentional alignment. Facts have proven that instruction fine-tuning has become an effective means to improve the zero-sample and small-sample generalization capabilities of LLMs.
The recent success of ChatGPT and GPT-4 provides a huge opportunity to use instruction fine-tuning to improve open source LLMs. Meta LLaMA is a family of open source LLMs with performance comparable to proprietary LLMs such as GPT-3. To teach LLaMA to follow instructions, Self-Instruct was quickly adopted due to its superior performance and low cost. For example, Stanford's Alpaca model uses 52k command compliance samples generated by GPT-3.5, and the Vicuna model uses about 70k command compliance samples from ShareGPT.
In order to advance the SOTA level of LLMs instruction fine-tuning, Microsoft Research used GPT-4 as a teacher model for self-intruct fine-tuning for the first time in its paper "Instruction Tuning with GPT-4" .
On the one hand, the researchers released the data generated by GPT-4, including a 52k instruction compliance data set in Chinese and English, and feedback data generated by GPT-4 to rate the output of the three instruction fine-tuning models.
On the other hand, an instruction fine-tuning LLaMA model and a reward model were developed based on the data generated by GPT-4. To evaluate the quality of instruction fine-tuning LLMs, the researchers evaluated test samples using three metrics: manual evaluation of three alignment criteria, automatic evaluation based on GPT-4 feedback, and ROUGE-L (Automated Summarization Evaluation Method) of unnatural instructions. one).
The experimental results verify the effectiveness of fine-tuning LLMs instructions using data generated by GPT-4. The 52k Chinese and English instruction compliance data generated by GPT-4 achieves better zero-sample performance on new tasks than previous SOTA models. Currently, researchers have disclosed the data generated using GPT-4 and related code.
DatasetsThis study uses GPT-4 to generate the following four datasets:
Figure 1 compares the English output response sets of GPT-4 and GPT-3.5. Figure 1 (a) and (b) show two output sets of verb-noun pairs with a frequency higher than 10. Figure 1 (c) compares the 25 most frequent pairs of words in the two sets. Figure 1 (d) compares the frequency distribution of sequence lengths, and the results show that GPT-4 tends to generate longer sequences than GPT-3.5.
This research is based on LLaMA 7B checkpoint and uses supervised fine-tuning to train two models: ( i) LLaMA-GPT4, trained on 52K English instruction compliance data generated by GPT-4. (ii) LLaMA-GPT4-CN, trained on 52K Chinese instruction follow data generated from GPT-4.
Reward Model
Reinforcement Learning with Human Feedback (RLHF) aims to align LLM behavior with human preferences, Reward modeling is one of its key parts, and the problem is often formulated as a regression task to predict the reward between a given cue and a response. However, this method usually requires large-scale comparative data. Existing open source models such as Alpaca, Vicuna, and Dolly do not involve RLHF due to the high cost of annotating comparative data. At the same time, recent research shows that GPT-4 is able to identify and repair its own errors and accurately judge the quality of responses. Therefore, to facilitate research on RLHF, this study created comparative data using GPT-4, as described above.
To evaluate data quality, the study also trained a reward model based on OPT 1.3B for evaluation on this data set. The distribution of the comparison data is shown in Figure 2 .
The study utilized the following three types of evaluations : Human evaluation, GPT-4, and unnatural instruction evaluation. The results confirm that using data generated by GPT-4 is an efficient and effective method for fine-tuning LLM instructions compared to other machine-generated data. Next we look at the specific experimental process.
Human evaluation
Figure 3 (a) is the comparison result of LLaMA-GPT4 vs Alpaca. The experiment shows that in Under the Helpfulness indicator, GPT-4 wins with a score of 54.12%. Figure 3(b) shows the comparison results of LLaMA-GPT4 vs GPT-4, which shows that the performance of LLaMA fine-tuned by GPT-4 instructions is similar to the original GPT-4.
Compare with SOTA using automatic evaluation
The study uses GPT-4 to automatically evaluate the responses of different models on 80 unseen questions. First collect answers from two chatbots, LLaMA-GPT-4 (7B) and GPT-4, and use other chatbots to publish answers, including LLaMA (13B), Alpaca (13B), Vicuna (13B), Bard (Google, 2023) and ChatGPT. For each evaluation, the study asked GPT-4 to rate the quality of the response between the two models on a scale of 1 to 10. The results are shown in Figure 4.
Figure 4 (c,d) compares all chatbots. LLaMA_GPT4 performs better: 7B LLaMA GPT4 performs better than 13B Alpaca and LLaMA. However, LLaMA_GPT4 still has a gap compared with large commercial chatbots such as GPT-4.
The researchers further studied the performance of all chatbots in Figure 5 below. First use GPT-4 to translate the chatbot's English responses into Chinese, and then use GPT-4 to translate the English questions into Chinese to get the answers. Comparisons with GPT-4 translations and generated Chinese responses are shown in 5 (a) and 5 (b), with all model results asked to answer in Chinese shown in 5 (c).
In Figure 6 below, the researchers compare LLaMA-GPT4 with GPT-4 and Alpaca unnatural instructions. The results show that LLaMA-GPT4 and GPT-4 perform better as the ground truth response length increases. This means they can follow instructions better when the scenes are more creative. Both LLaMA-GPT4 and GPT-4 can generate responses containing simple ground truth answers when the sequence length is short, and adding extra words can make the response more chat-like.
Please refer to the original paper for more technical and experimental details.
The above is the detailed content of For the first time: Microsoft uses GPT-4 to fine-tune large model instructions, and the zero-sample performance of new tasks is further improved.. For more information, please follow other related articles on the PHP Chinese website!