Since the opening of the ChatGPT API, a large number of studies have chosen to use the output of large basic models (LFM) such as ChatGPT and GPT-4 as training data, and then improve the capabilities of small models through imitation learning.
However, due to problems such as superficial imitation signals, insufficient training data, and lack of strict evaluation standards, the actual performance of small models has been overestimated.
From an effect point of view, the small model is more inclined to imitate the output style of LFM rather than the inference process.
## Paper link: https://arxiv.org/pdf/2306.02707.pdf
To address these challenges, Microsoft recently released a 51-page paper proposing a 13 billion-parameter Orca model that can learn to imitate the reasoning process of LFMs.
The researchers designed rich training signals for the large model, so that Orca can learn explanation traces, step-by-step thinking processes, complex instructions, etc. from GPT-4, and by ChatGPT Teachers assist in guidance; and mining large-scale and diverse imitation data through sampling and selection can further enhance the progressive learning effect.
In experimental evaluation, Orca outperformed other SOTA instruction fine-tuning models, achieving double the performance of Vicuna-13B in complex zero-shot inference benchmarks such as BigBench Hard (BBH) Performance, a 42% performance improvement was also achieved on AGIEval.
Additionally, Orca achieved performance on par with ChatGPT on the BBH benchmark and on professional and academic exams such as the SAT, LSAT, GRE, and GMAT There is only a 4% performance gap in , and they are all measured in a zero-sample setting without thought chaining.
#The findings show that letting models learn from step-by-step explanations, whether those explanations are generated by humans or more advanced AI models, They are all promising research directions to improve model capabilities and skills.
Explanation TuningDataset construction
In the training data, each instance includes three parts, namely system message, user query and LFM reply.
System message (system message) is placed at the beginning of the prompt and provides basic context, guidance and other related details to LFM.
System messages can be used to change the length of responses, describe the personality of the AI assistant, establish acceptable and unacceptable LFM behavior, and determine the response structure of the AI model.
The researchers hand-crafted 16 pieces of system information to design different types of LFM responses, which can generate creative content and solve information query problems. The most important thing is to be able to generate explanations and prompts based on the prompts. Step by step reasoning answers.
User query Defines the actual task you want LFM to perform.
In order to obtain a large number of diverse user queries, researchers used the FLAN-v2 collection to extract 5 million user queries (FLAN-5M) and collect ChatGPT responses; Then we further extracted 1 million instructions (FLAN-1M) from the 5 million instructions to collect the responses of GPT-4.
The FLAN-v2 set consists of five sub-sets, namely CoT, NiV2, T0, Flan 2021 and Dialogue, where each subset contains multiple tasks, and each task is a query collection.
Each sub-collection is related to multiple academic datasets, and each dataset has one or more tasks that focus mainly on zero-shot and few-shot queries.
In this work, the researchers only sampled the zero-shot queries for training Orca and did not sample from the Dialogue subset because these queries often lack the context to be useful from ChatGPT reply.
Let ChatGPT act as Teaching Assistant
First train Orca on FLAN-5M data (ChatGPT enhancement), followed by the second stage of training (GPT-4 enhancement) on FLAN-1M.
There are two main reasons for using ChatGPT as an intermediate teacher assistant:
1. Capability gap
Although the parameter amount of GPT-4 has not been disclosed, the 13 billion parameters of Orca are definitely many times smaller than GPT-4, and the capability gap between ChatGPT and Orca is Smaller, more suitable as an intermediate teacher, and this approach has been proven to improve the imitation learning performance of smaller student models in knowledge distillation.
This approach can also be seen as a kind of progressive learning or course learning, in which students first learn from easier examples and then move on to more difficult examples, assuming that the more Long responses will be more difficult to imitate than shorter responses, allowing for improved reasoning and step-by-step explanation skills from larger teacher models.
#2. Cost and Time
Large-scale data collection from Azure OpenAI API There will be some restrictions, including the rate limit of requests per minute to prevent excessive traffic; due to service delay issues, the number of available tokens per minute is limited; the prompt length and the monetary cost of token completion.
In comparison, ChatGPT API is faster and cheaper than GPT-4 terminal, so more is collected from ChatGPT than GPT-4 5 times the data.
It can be observed from the distribution of reply lengths of ChatGPT and GPT-4 corresponding to different system messages that the replies of GPT-4 are longer on average than those of ChatGPT 1.5x, enabling Orca to progressively learn from the complexity of teacher explanations, and demonstrating the impact of teacher help through ablation experiments.
Training
In the word segmentation stage, the researchers used LLaMA’s byte pair encoding (BPE) tokenizer to process input samples where multi-digit numbers are split into multiple single digits and fall back to bytes to decompose unknown UTF-8 characters.
In order to handle variable-length sequences, a filler word [[PAD]] is introduced in the vocabulary of the LLaMA tokenizer, and the final vocabulary contains 32001 tokens
In order to optimize the training process and effectively utilize available computing resources, researchers used packing technology to concatenate multiple input instances into a sequence before training the model.
During the packing process, the total length of the concatenated sequence does not exceed max_len=2048 tokens. The input samples will be randomly shuffled and divided into several groups. The length of each group of concatenated sequences At most max_len
Taking into account the length distribution of boosting instructions in the training data, the packing factor of each sequence is 2.7
To train Orca, The researchers chose to only calculate the loss of tokens generated by the teacher model, which means that learning to generate responses conditioned on system information and task instructions can ensure that the model focuses on learning from the most relevant and informative tokens, improving the efficiency of the training process. Overall efficiency and effectiveness.
Finally, Orca was trained on 20 NVIDIA A100 GPUs with 80GB of memory. It was first trained on FLAN-5M (ChatGPT enhanced) for 4 epochs, which took 160 hours; then on FLAN-1M (GPT -4 enhancement) and continue to train for 4 epochs
Due to traffic restrictions, terminal load and reply length issues, multiple GPT-3.5-turbo (ChatGPT) and GPT-4 The terminals took 2 and 3 weeks to collect data respectively.
The researchers mainly verified Orca’s reasoning capabilities.
As can be seen in the AGIEval experiment, Orca's performance is equivalent to Text-da-Vinci-003 and achieves 88% of ChatGPT's Performance, but significantly behind GPT-4
For analysis and reasoning tasks, Vicuna performed significantly worse, retaining only 62% of ChatGPT quality, indicating that this open source language model The reasoning ability is very poor.
While Orca performs equally well with Text-da-Vinci-003, it is still 5 points lower than ChatGPT, Orca performs better on math-related tasks (in SAT, GRE, GMAT) There is a big gap between it and ChatGPT.
Compared to Vicuna, Orca shows stronger performance, outperforming Vicuna in every category, with an average relative improvement of 42%.
GPT-4 far outperforms all other models, but there is still significant room for improvement in this benchmark, with all models currently performing significantly below human scores .
Orca's performance varies greatly depending on the type of system message. For trained models, empty system messages tend to work well. .
Orca outperforms ChatGPT (Orca-beats-ChatGPT example) on 325 samples of different tasks, most of which are from LogiQA (29% ), while other LSAT tasks and SAT-English tasks each account for less than 10%
The reasoning evaluation results on the Big-Bench Hard Results data set show that Orca’s performance in all tasks The overall performance is slightly better than ChatGPT, but significantly behind GPT-4; 113% higher than Vicuna performance
The above is the detailed content of Is 'imitation learning' just a cliché? Explanation fine-tuning + 13 billion parameters Orca: reasoning ability equals ChatGPT. For more information, please follow other related articles on the PHP Chinese website!