Translator|Cui Hao
Reviewer|Sun Shujuan
Most LLMs follow a similar life cycle: first, "upstream", the model is pre-trained. Due to the high data volume and computational requirements, it is mostly the prerogative of large technology companies and universities. Recently, there have also been some collaborations (such as BigScience workshops) to jointly advance the development of the LLM field. A handful of well-funded startups, such as Cohere and AI21 Labs, also offer pre-trained LLM.
After release, the model is adopted and deployed "downstream" by application-focused developers and enterprises. At this stage, most models require an additional fine-tuning step to fit the specific domain and task. Others, like GPT-3, are more convenient because they can learn various language tasks directly during prediction (zero or few predictions).
Finally, time knocks on the door, and a better model appears around the corner - either with more parameters, more efficient use of hardware, or a more fundamental improvement in modeling human language . Models that lead to substantial innovation can spawn entire model families. For example, BERT lives on in BERT-QA, DistilBERT, and RoBERTa, which are all based on the original architecture.
In the following chapters, we will explore the first two stages of this life cycle - pre-training and fine-tuning for deployment.
Most teams and NLP practitioners will not participate in the pre-training of LLM, but in its fine-tuning and deployment. However, to successfully pick and use a model, it's important to understand what's going on "under the hood." In this section, we will look at the basic ingredients of LLM.
Each item will not only affect the selection, but also the fine-tuning and deployment of LLM.
Most of the data used for LLM training is text data covering different styles, such as literature, user-generated content and news data. After seeing a variety of different text types, the resulting model becomes aware of the details of the language. In addition to textual data, code is often used as input to teach the model to generate effective programs and code snippets.
As expected, the quality of the training data has a direct impact on the performance of the model - and also on the required size of the model. If you prepare your training data in a smarter way, you can improve the quality of your model while reducing its data size. One example is the T0 model, which is 16 times smaller than GPT-3 but outperforms it on a range of benchmark tasks. Here's the trick: instead of just using any text as training data, it uses the task formula directly, making its learning signal more focused. Figure 3 illustrates some training examples.
Figure 3: T0 trained on a wide range of well-defined language tasks
Final note on training data: We often hear that language models are based on Supervised training. Although this approach is attractive, it is technically wrong. On the contrary, well-formatted text already provides the necessary learning signals, saving us from the tedious manual data annotation process. The labels to be predicted correspond to past and/or future words in a sentence. As a result, annotation occurs automatically and at scale, enabling relatively rapid progress in the field.
Once the training data has been assembled, we need to package it into a form that the model can apply. Neural networks are fed with algebraic structures (vectors and matrices), and the best algebraic representation of language is an ongoing search—from simple phrases to containing highly differentiated contextual information. Each new step increases the complexity of natural language and exposes the limitations of current representations.
The basic unit of language is the word. In the early days of NLP, this gave rise to bag-of-word representation, which throws all the words in a text together regardless of their ordering. Look at these two examples.
In the bag-of-words world, these sentences would get exactly the same representation because they are made up of the same words. Obviously, this only encompasses a small part of their meaning.
Sequence representation accommodates information about word order. In deep learning, the processing of sequences was initially implemented in order-aware recurrent neural networks (RNN). However, going one step further, the basic structure of language is not purely sequential but hierarchical. In other words, we are not talking about lists, but trees. Words that are further apart can actually have stronger syntactic and semantic connections than adjacent words. Please see the example below.
Here, she refers to that girl. By the time an RNN reaches the end of the sentence and finally sees her, its memory of the beginning of the sentence may already be fading, thus not allowing it to restore the relationship.
To resolve these long-range dependencies, more complex neural structures have been proposed to build a more discriminative contextual memory. The idea is to keep words related to future predictions in memory and forget the others. This is the contribution of Long Short-Term Memory (LSTM) units and Gated Recurrent Units (GRU). However, these models are not optimized for the specific location to be predicted, but rather for a general future context. Furthermore, due to their complex structure, they are even slower to train than traditional RNNs.
Finally, people abandoned recursion, proposed an attention mechanism, and incorporated it into the Transformer architecture. Attention allows the model to focus back and forth between different words during prediction. Each word is weighted according to its relevance to the specific location to be predicted. For the above sentence, once the model reaches the position of "she", girl has a higher weight than at, even though it is much further away in the linear order.
So far, the attention mechanism is closest to the biological operation of the human brain in information processing. Research shows that attention can learn hierarchical syntactic structures, including a series of complex syntactic phenomena. It also allows parallel computing for faster and more efficient training.
With appropriate training data representation, our model can start learning. There are three general goals for pre-training language models: sequence-to-sequence conversion, autoregression, and autoencoding. All of this requires the model to have extensive linguistic knowledge.
The original task solved by the encoder-decoder architecture and the Transformer model is sequence-to-sequence conversion: a sequence is converted into a sequence in a different representation framework. The classic sequence-to-sequence task is machine translation, but other tasks, such as summarization, are also often formulated in this way. Note that the target sequence does not have to be text - it can also be other unstructured data, such as images, as well as structured data, such as programming languages. An example of sequence-to-sequence LLMs is the BART series.
The second task is automatic regression, which is also the original language modeling goal. In autoregression, the model learns to predict the next output (token) based on previous tokens. Learning signals are limited by the one-way nature of the enterprise—the model can only use information from the right or left side of the predicted token. This is a major limitation because words can depend on both past and future positions. As an example, consider how the verb written affects the following sentence in both directions.
Here, the position of the paper is restricted to something writable, and the position of the student is restricted to a human being, or, at any rate, another intelligent entity capable of writing .
Many of the LLMs in today’s headlines are autoregressive, including the GPT series, PaLM, and BLOOM.
The third task - automatic encoding - solves the problem of one-way. Autoencoding is very similar to learning classic word embeddings. First, we corrupt the training data by hiding a certain proportion of tokens in the input (usually 10-20%). The model then learns to reconstruct the correct input based on its surrounding environment, taking into account previous and subsequent markers. A typical example of an autoencoder is the BERT family, where BERT stands for Bidirectional Encoder Representation from Transformers.
The basic components of the language model are the encoder and the decoder. The encoder transforms the raw input into a high-dimensional algebraic representation, also known as a "hidden" vector. Wait a minute -- hidden? Well, there's actually no big secret at this point. Sure, you can look at the representation, but a lengthy vector of numbers won't convey anything meaningful to a human being. This requires the mathematical intelligence of our model to handle it. The decoder reproduces the hidden representation in an understandable form, such as another language, programming code, image, etc.
Figure 4: Basic pattern of encoder-decoder architecture
The encoder-decoder architecture was originally introduced for recurrent neural networks. Since the introduction of attention-based Transformer models, traditional recursion has lost its popularity, while the encoder-decoder idea has persisted. Most natural language understanding (NLU) tasks rely on encoders, while natural language generation (NLG) tasks require decoders, and sequence-to-sequence conversion requires both components.
We will not discuss the details of Transformer architecture and attention mechanism here. For those who want to master these details, be prepared to spend a lot of time figuring it out.
Language modeling is a powerful upstream task--if you have a successful language model, congratulations --This is a smart model. Instead, NLP is mostly used for more targeted downstream tasks such as sentiment analysis, question answering, and information extraction. This is when transfer learning is applied and existing language knowledge is reused to address more specific challenges. During fine-tuning, part of the model is "frozen" and the remaining parts are further trained with data from a specific domain or task.
Explicit fine-tuning adds complexity on the road to LLM deployment. It can also lead to model explosion, where each business task requires its own fine-tuned model, leading to an unmaintainable variety of models. Therefore, efforts have been made to use few or zero learning steps to get rid of the fine-tuning step (such as in GPT-3). This learning occurs during the prediction process: the model is provided with "hints" - a task description and possibly a few training examples - to guide its predictions of future instances.
While much faster to implement, the convenience factor of zero or small number of learnings is offset by its lower prediction quality. Additionally, many of these models require access via cloud APIs. Early in development, this may be a welcome opportunity - however, at more advanced stages, it may turn into yet another unwanted external dependency.
Looking at the continuous supply of new language models in the artificial intelligence market, choose the right model for the specific downstream task and keep up with the state-of-the-art technology Synchronization is tricky.
Research papers often benchmark each model on specific downstream tasks and datasets. Standardized task suites, such as SuperGLUE and BIG-bench, allow for unified benchmarking of numerous NLP tasks and provide a basis for comparison. However, we should remember that these tests are prepared in a highly controlled environment. As of today, the generalization capabilities of language models are quite limited - therefore, transferring to real-life datasets can significantly affect the performance of the model. Evaluating and selecting an appropriate model should include conducting experiments on data that is as close as possible to production data.
As a rule of thumb, the pre-training target provides an important tip: autoregressive models perform well in text generation tasks such as conversational AI, question answering, and text summarization, while autoencoders excel at " "understand" and structured languages, e.g. for sentiment analysis and various information extraction tasks. In theory, models used for zero-point learning can perform a variety of tasks as long as they receive appropriate hints - however, their accuracy is usually lower than that of fine-tuned models.
To make things more concrete, the figure below shows how popular NLP tasks relate to language models prominent in the NLP literature. These associations are calculated based on a variety of similarity and aggregation measures, including embedding similarity and distance-weighted co-occurrence. Higher-scoring model-task pairs, such as BART/Text Summarization and LaMDA/Conversational AI, indicate good matches based on historical data.
Figure 5: Association strength between language model and downstream tasks
In this article, We have covered the basic concepts of LLM and the main dimensions at which innovation is taking place. The table below provides a summary of the main features of the most popular LLMs.
Table 1: Summary of features of the most popular large-scale language models
Let us summarize the general guidelines for selection and LLM.
1. When evaluating potential models, be clear about where you are in the AI journey.
2. To align with your downstream tasks, the AI team should create a shortlist of models based on the following criteria.
Focus on downstream tasks and benchmark results in the academic literature
Consistency between pre-training goals and downstream tasks: consider autoencoding for NLGU and autoencoding for NLG return.
Previously reported experience with this model-task combination.
3. Test the shortlisted models to understand real-world tasks and data sets to get an initial feel for performance.
4. In most cases, it is possible to achieve better quality through specialized fine-tuning. However, if you don’t have the in-house technical capabilities or budget for fine-tuning, or you need to cover a large number of tasks, consider few/zero-shot learning.
5.LLM innovations and trends are short-lived. When working with language models, be aware of their life cycle and overall activity in the LLM domain, and be aware of the opportunities to step up your game.
Finally, be aware of the limitations of LLMs. Although they have an amazing, human-like ability to produce language, their overall cognitive abilities fall short of those of us humans. The world knowledge and reasoning capabilities of these models are strictly limited to the information they find on the surface of language. They also fail to keep facts in time and may provide you with outdated information without blinking an eye. If you are building an application that relies on generating up-to-date or even raw knowledge, consider combining your LLM with additional multimodal, structured, or dynamic knowledge sources.
Original link: https://www.topbots.com/choosing-the-right-language-model/
Cui Hao, 51CTO community editor , a senior architect with 18 years of experience in software development and architecture, and 10 years of experience in distributed architecture.
The above is the detailed content of Choosing the right language model for NLP. For more information, please follow other related articles on the PHP Chinese website!