In the process of actual exploration, practitioners may be struggling to find an AI model suitable for their application: Should they choose LLM or fine-tuning a model? If using LLM, which one should I choose?
Recently, scholars from Amazon, Texas A&M University, Rice University and other institutions have discussed the development process of language models such as ChatGPT, and their article has also been praised by Yann LeCun Retweet.
##Paper: https://arxiv.org/abs/2304.13712
Related resources: https://github.com/Mooler0410/LLMsPracticalGuide
This article will start from the perspective of practical application and discuss the tasks suitable for LLM and the practical issues such as models, data and tasks that need to be considered when selecting a model.
1 IntroductionIn recent years, the rapid development of large language models (LLM) has triggered a revolution in the field of natural language processing (NLP). These models are extremely powerful and promise to solve many different kinds of NLP tasks – from natural language understanding (NLU) to generation tasks, and even pave the way to artificial general intelligence (AGI). However, in order to use these models effectively and efficiently, we need to have a practical understanding of their capabilities and limitations, as well as an understanding of the data and tasks involved in NLP.
This paper focuses on various aspects of practical application of LLM in downstream NLP tasks to provide guidance to practitioners and end-users. The goal of this guide is to provide readers with practical and useful advice on whether to use an LLM for a given task and how to choose the most suitable LLM - this will take into account many factors, such as model size, computational requirements, and specific domain. Whether there is a pre-trained model, etc. This article also introduces and explains LLM from a practical application perspective, which can help practitioners and end-users successfully leverage the power of LLM to solve their own NLP tasks.
The structure of this article is: This article will first briefly introduce LLM, in which the most important GPT-style and BERT-style architectures will be discussed. Then we will provide an in-depth introduction to the key factors affecting model performance in terms of data, including pre-training data, training data/tuning data, and test data. In the last and most important part, this article will delve into various specific NLP tasks, introduce whether LLM is suitable for knowledge-intensive tasks, traditional NLU tasks, and generation tasks. In addition, it will also describe the new capabilities and challenges that these models continue to acquire. real-world application scenarios. We provide detailed examples to highlight the usefulness and limitations of LLM in practice.
In order to analyze the capabilities of large language models, this article will compare them with fine-tuned models. We do not yet have a widely accepted standard for the definition of LLM and fine-tuned models. In order to make a practical and effective distinction, the definition given in this article is as follows: LLM refers to a large language model pre-trained on a large-scale data set and does not adjust the data for specific tasks; fine-tuned models are usually smaller, and they are pre-trained Later, further fine-tuning will be done on smaller task-specific data sets to optimize their performance on this task.
This article summarizes practical guidelines for using LLM in:
##Figure 1 : This evolutionary tree of modern LLMs traces the development of language models in recent years, highlighting some of the best-known models. Models on the same branch are more closely related. Transformer-based models are not represented in gray: decoder-only models are the blue branch, encoder-only models are the pink branch, and encoder-decoder models are the green branch. A model's vertical position on the timeline indicates when it was released. Solid squares represent open source models, and empty squares represent closed source models. The stacked bar chart in the lower right corner refers to the number of models for each company and institution.
#This section will briefly introduce the current best-performing LLM. These models have different training strategies, model architectures and use cases. To understand the overall picture of LLMs more clearly, we can divide them into two broad categories: encoder-decoder or encoder-only language models and decoder-only language models. Figure 1 shows the evolution of the language model in detail. Based on this evolutionary tree, we can observe some interesting conclusions:
a) The decoder-only model is gradually becoming the dominant model in LLM development. In the early stages of LLM's development, decoder-only models were not as popular as encoder-only and encoder-decoder models. But after 2021, the emergence of GPT-3 changed the industry picture, and only the decoder model experienced explosive development. At the same time, BERT also brought an initial explosive growth to the encoder-only model, but after that, the encoder-only model gradually faded out of sight.
b) OpenAI continues to maintain its leading position in the direction of LLM, now and likely in the future. Other companies and institutions are playing catch-up to develop models that are comparable to GPT-3 and GPT-4. OpenAI's leading position may be attributed to its continued investment in technology, even if the technology was not widely recognized in its early days.
c) Meta has made outstanding contributions to open source LLM and promoting LLM research. Meta stands out as one of the most generous commercial companies when it comes to its contributions to the open source community, especially related to LLMs, as it open sourced all LLMs it developed.
d) There is a trend towards closed source development in LLM. In the early stages of LLM development (before 2020), the vast majority of models were open source. However, with the launch of GPT-3, companies are increasingly choosing to close-source their models, such as PaLM, LaMDA, and GPT-4. Therefore, it is increasingly difficult for academic researchers to conduct LLM training experiments. This has the consequence that API-based research may become the dominant approach in academia.
e) The encoder-decoder model still has development prospects, because companies and institutions are still actively exploring this type of architecture, and most models are open source. Google has made significant contributions to open source encoder-decoders. However, due to the flexibility and versatility of the decoder-only model, Google's chances of success seem slimmer by persisting in this direction.
Table 1 briefly summarizes the characteristics of various representative LLMs.
Table 1: Characteristics of large language models
2.1 BERT-style language model: encoder - decoder or just encoder
The development of unsupervised learning of natural language has made great progress in recent times because natural language data is easy to obtain and unsupervised training paradigms can be used to better utilize extremely large-scale data sets. A common approach is to predict occluded words in a sentence based on context. This training paradigm is called a Masked Language Model. This training method allows the model to gain a deeper understanding of the relationship between words and their context. These models are trained on large text corpora, using techniques such as the Transformer architecture, and have achieved state-of-the-art performance on many NLP tasks, such as sentiment analysis and named entity recognition. Famous masked language models include BERT, RoBERTa and T5. Due to its successful performance on a variety of tasks, masked language models have become an important tool in the field of natural language processing.
2.2 GPT-style language model: decoder only
Although the architecture of language models is generally task-agnostic, However, these methods require fine-tuning based on data sets for specific downstream tasks. Researchers have found that increasing the size of a language model can significantly improve its performance with few or zero samples. The most successful model in improving performance with few and zero samples is the autoregressive language model, which is trained to generate the next word based on the previous words in a given sequence. These models have been widely used in downstream tasks such as text generation and question answering. Autoregressive language models include GPT-3, OPT, PaLM, and BLOOM. The revolutionary GPT-3 showed for the first time that learning through hints and context can give reasonable results with few/zero samples, and thus demonstrated the superiority of autoregressive language models.
There are also models optimized for specific tasks, such as CodeX for code generation and BloombergGPT for the financial field. A major recent breakthrough is ChatGPT, a model of GPT-3 optimized for conversational tasks that generates more interactive, coherent, and contextual conversations for a variety of real-world applications.
This section explains the critical role of data in choosing the right model for downstream tasks. The impact of data on model effectiveness begins in the pre-training phase and continues through the training and inference phases.
Key Point 1
(1) When downstream tasks will use data outside the distribution, such as using adversarial samples or data domain changes At this time, the generalization ability of LLM is better than that of fine-tuned model.
(2) When the labeled data is limited, LLM is better than the fine-tuned model; when there is abundant labeled data, both are reasonable choices, depending on the specific task need.
(3) It is recommended to choose a model whose data domain used for pre-training is similar to the data domain of the downstream task.
This section will discuss in detail whether LLM is useful on various downstream NLP tasks and the corresponding model capabilities. Figure 2 is a decision flow diagram summarizing all discussions. When faced with a certain task, quick decisions can be made based on this process.
Figure 2: The decision-making process when a user chooses an LLM or a fine-tuned model for an NLP application. This decision flow chart helps users evaluate whether the downstream NLP task at hand meets specific criteria and determine whether an LLM or a fine-tuned model is best suited for their application based on the evaluation results. In the decision-making process in the figure, Y indicates that the conditions are met and N indicates that the conditions are not met. The yellow circle next to Y for the last condition indicates that there is currently no model that is well suited for this type of application.
4.1 Traditional NLU tasks
Traditional NLU tasks They are some basic tasks in the field of NLP, including text classification, named entity recognition (NER), entailment prediction, etc. Many of these tasks can be used as intermediate steps in larger AI systems, such as using NER for knowledge graph construction.
Not applicable to LLM: For most natural language understanding tasks, such as tasks in GLUE and SuperGLUE, if the task already has rich well-annotated data and there are very few data in the test set outside the distribution , then the performance of the fine-tuned model is still better. The gap between small fine-tuned models and LLMs also differs when the tasks and datasets vary.
Suitable for LLM: However, there are some NLU tasks that are better suited to be handled by LLM. Two representative tasks are complex text classification problems and adversarial natural language reasoning.
Key Point 2
For traditional natural language understanding tasks, fine-tuning models are usually a better choice than LLM, but if the task Strong generalization capabilities are needed, then LLM can help.
4.2 Generation Task
The goal of natural language generation is to create coherent, meaningful and contextual Symbol sequences, which roughly include two broad categories of tasks. The first category of tasks focuses on converting input text into new sequences of symbols. Examples include paragraph summarization and machine translation. The second category of tasks is "open generation," where the goal is to generate text or symbols from scratch so that they accurately match the input description, such as writing an email, writing a new article, creating a fictional story, and writing code.
Applicable to LLM: The generation task requires the model to fully understand the input content or requirements and also requires a certain degree of creativity. This is what LLM excels at.
Not applicable LLM: On most translation tasks with rich resources and translation tasks with few resources, fine-tuned models perform better, such as DeltaLM Zcode. For machine translation with rich resources, fine-tuned models slightly outperform LLMs. For machine translation with very few resources, such as English-Kazakh translation, fine-tuned models significantly outperformed LLM.
Key Point 3
Thanks to its strong generation ability and creativity, LLM has advantages in most generation tasks.
4.3 Knowledge-intensive tasks
##Knowledge-intensive NLP tasks are those that rely heavily on background knowledge and expertise in specific fields. Knowledge or general real-world knowledge task category. These tasks require more than pattern recognition or syntactic analysis. They rely heavily on memory and the appropriate use of knowledge related to specific entities, events, and common sense in our real world.
Suitable for LLM: Generally speaking, if there are billions of training tokens and parameters, the amount of real-world knowledge contained in LLM can far exceed that of a fine-tuned model.
Not applicable to LLM: Some other tasks require different knowledge than what is learned by LLM. The required knowledge is not what the LLM learns about the real world. In such a task, LLM has no clear advantage.
Key Point 4
(1) Thanks to the huge real-world knowledge, LLM is good at handling knowledge-intensive tasks. (2) When the knowledge requirements do not match the learned knowledge, LLM will encounter difficulties; or when the task only requires contextual knowledge, the fine-tuning model can achieve the same performance as LLM.
4.4 The ability to expand the scale
Expanding the scale of LLM (such as parameters, training calculations, etc.) can Greatly assists in pre-training language models. By increasing the model size, the model's ability to handle multiple tasks is often improved. Reflected on certain indicators, the performance of the model shows a power law relationship with the model size. For example, the cross-entropy loss used to measure language modeling performance decreases linearly with exponential growth in model size, which is also known as the "scaling-law." For some key capabilities, such as reasoning, scaling up the model can gradually improve these capabilities from a very low level to a usable level, even close to human levels. This subsection will introduce the use of LLM in terms of the impact of scale on the capabilities and behavior of LLM.
LLM use cases in reasoning: Reasoning involves understanding information, making inferences and making decisions, and is a core ability of human intelligence. For NLP, reasoning is extremely challenging. Many existing reasoning tasks can be divided into two categories: commonsense reasoning and arithmetic reasoning. Model enlargement can greatly improve the arithmetic reasoning ability of LLM. Common sense reasoning requires the LLM not only to remember factual knowledge but also to perform some reasoning steps about the facts. Common sense reasoning capabilities gradually improve as the size of the model increases. Compared to fine-tuned models, LLM performs better on most datasets.
LLM use cases in emergent capabilities: Increasing the size of the model can also give the model some unprecedented and wonderful capabilities that transcend power law rules. These abilities are called "emergent abilities." As defined in the paper "Emergent Abilities of Large Language Models": The emergent ability of LLM refers to the ability that small-scale models do not have but appear in large-scale models. (For more interpretations of this paper, please refer to "The new work of Jeff Dean and others: Looking at language models from another angle, unable to be discovered if the scale is not large enough") This means that we cannot infer and predict this ability based on the performance improvement of small-scale models; On some tasks, once the size of the model exceeds a certain level, it may suddenly achieve excellent performance. Emergent capabilities are often unpredictable and unexpected, which can result in a model's ability to handle tasks that arise randomly or are unexpected.
Not applicable LLM and understanding emergence: Although in most cases the model is larger and performs better, there are exceptions.
On some tasks, as the scale of LLM increases, the model performance will begin to decline. This is also known as the Inverse Scaling Phenomenon. In addition, the researchers also observed another interesting phenomenon related to scale, namely the U-shaped Phenomenon. As the name suggests, this phenomenon means that as the LLM model grows larger, its performance on a specific task will initially improve, then start to decline, and then improve again.
To advance research in this area, we must gain a deeper understanding of emergent capabilities, counterscaling phenomena, and U-shaped phenomena.
Key Points 5
(1) As the model size increases exponentially, the arithmetic reasoning and common sense reasoning capabilities of LLM will also increase. (2) As the scale of LLM increases, emergent capabilities can discover new uses by chance, such as word processing capabilities and logical capabilities. (3) Model capabilities do not always increase with scale, and our understanding of the relationship between the capabilities of large language models and scale is still limited.
4.5 Miscellaneous Tasks
In order to better understand the strengths and weaknesses of LLM, we will talk about the ones not mentioned above other tasks involved.
Not applicable LLM: LLM often has difficulty on these tasks if the model goals are different from the training data.
Suitable for LLM: LLM is especially suitable for certain specific tasks. To give some examples, LLM is very good at imitating humans. LLM can also be used to evaluate the quality of certain NLG tasks such as summarization and translation. Some capabilities of LLM can also bring benefits other than performance improvements, such as interpretability.
Key Point 6
(1) For tasks that are far away from the pre-training targets and data of LLM, fine-tuning models and domain-specific models are still There is a place for it. (2) LLM is good at imitating humans, data annotation and generation. They can also be used for quality assessment of NLP tasks and have benefits such as interpretability.
4.6 Real-world "Task"
Finally, this section discusses the use of LLM and fine-tuning models in real-world "Tasks" ” application on. The term "task" is used loosely here because, unlike academic settings, real-world settings often lack well-formed definitions. Many requirements for models cannot even be considered NLP tasks. The real-world challenges faced by the model come from the following three aspects:
#Essentially, these real-world puzzles from user requests are caused by deviations from the distribution of any NLP dataset designed for a specific task. Public NLP datasets do not reflect how these models are used.
Point 7
Compared to fine-tuning the model, LLM is more suitable for processing real-world scenarios. However, assessing the effectiveness of models in the real world remains an open question.
Although LLM is suitable for a variety of downstream tasks, there are other factors to consider, such as efficiency and reliability. Issues involved in efficiency include the training cost of LLM, inference latency, and tuning strategies for efficient parameter utilization. In terms of trustworthiness, the LLM's robustness and calibration capabilities, fairness and bias, potential error correlations, and security challenges need to be considered. Key Point 8(1) If the task is cost-sensitive or has strict latency requirements, then lightweight local fine-tuning models should be prioritized. When deploying and delivering your model, consider tuning to make efficient use of parameters. (2) LLM’s zero-shot approach prevents it from learning shortcuts from task-specific data sets, which is common for fine-tuned models. Nonetheless, LLM still exhibits certain shortcut learning problems. (3) Since LLM’s potentially harmful or biased output and hallucination issues may lead to serious consequences, security issues related to LLM should receive the greatest attention. Methods such as human feedback promise to alleviate these problems.
This practical guide provides insights into LLM and best practices for using LLM on a variety of NLP tasks. Hopefully this will help researchers and practitioners harness the potential of LLM and drive innovation in language technology.
Of course, LLM also has some challenges that need to be solved:
The above is the detailed content of The evolutionary tree of large language models, this is a super-detailed 'eating' guide to ChatGPT. For more information, please follow other related articles on the PHP Chinese website!