Choosing the right language model for NLP-AI-php.cn

Translator|Cui Hao

Reviewer|Sun Shujuan

1. Opening

Choosing the right language model for NLP

##Large language models (LLMs) are Deep learning models trained to generate text. With impressive capabilities, LLMs have become the leader in modern natural language processing (NLP). Traditionally, they have been pre-trained by academic institutions and large technology companies such as OpenAI, Microsoft and Nvidia. Most of them are subsequently made available for public use. This plug-and-play approach is an important step towards large-scale AI applications - enterprises can now focus on fine-tuning existing LLM models for specific use cases, rather than spending significant resources to train models with general language capabilities. Model of knowledge.

However, picking the right model for your application can still be tricky. Users and other stakeholders must make choices among a vibrant language model and associated innovation scenarios. These improvements touch on different components of the language model, including its training data, pre-training targets, architecture and fine-tuning methods - each aspect could fill a book. On top of all this research, the marketing and AI halo surrounding language models has made things even more unclear.

This article explains the main concepts and principles behind LLMs. Its purpose is to provide non-technical stakeholders with an intuitive understanding and a language for efficient interaction with developers and AI experts. To extend coverage, the article includes analysis rooted in a large number of NLP-related publications. Although we will not delve into the mathematical details of language models, these can be easily retrieved from the references.

The article is structured as follows: First, the language model is placed in the evolving NLP environment. Section 2 explains how LLMs are built and pretrained. Finally, the fine-tuning process is described and some guidance on model selection is provided.

2. The world of language models

1. Bridging the human-machine gap

Language is a fascinating skill of the human mind - it is a universal protocol for knowledge exchange and the expression of subjectivity Thoughts, such as intentions, opinions, and emotions. In the history of artificial intelligence, there have been multiple waves of research using mathematical means to approach ("model") human language. Before the deep learning era, representations were based on simple algebraic and probabilistic concepts such as one-hot representations of words, sequence probabilistic models, and recursive structures. With the development of deep learning in the past few years, the accuracy, complexity, and expressiveness of language representations have increased.

In 2018, BERT was launched as the first LLM based on the new Transformer architecture. Since then, Transformer-based LLM has gained strong momentum. Language modeling is particularly attractive due to its generality. Although many real-world NLP tasks such as sentiment analysis, information retrieval, and information extraction do not require language generation, it is assumed that a language-generating model also has the skills to solve a variety of more specialized language challenges.

2. Size Matters

Learning occurs on the basis of parameters - variables that are optimized during training to achieve the best prediction quality. As the number of parameters increases, the model is able to gain more granular knowledge and improve its predictions. Since the introduction of the first batch of LLMs in 2017-2018, we have seen an exponential explosion in parameter sizes - while the groundbreaking BERT was trained with 340M parameters, the model released in 2022, Megatron-Turing NLG, was trained with 530B Parameter training - increased by more than a thousand times.

Choosing the right language model for NLP

Figure 1: The parameter size of language models grows exponentially over time

Therefore, the mainstream uses an ever-increasing number of parameters to grandstand. However, some critics point out that the growth rate of model performance is inconsistent with the growth rate of model size. On the other hand, model pre-training leaves a considerable carbon footprint. Downsizing is urgent and makes progress in language modeling more sustainable.

3. Life cycle of language model

The vision of LLM is competitive, and innovation is short-lived. The chart below shows the top 15 most popular LLM models in the 2018-2022 time period, as well as their share over time.

Choosing the right language model for NLP

Figure 2: Mention rate and share of the top 15 most popular language models

We can see that most models are relatively It became less popular in a short period of time. To stay ahead of the curve, users should monitor current innovations and evaluate whether upgrading is worth it.

Most LLMs follow a similar life cycle: first, "upstream", the model is pre-trained. Due to the high data volume and computational requirements, it is mostly the prerogative of large technology companies and universities. Recently, there have also been some collaborations (such as BigScience workshops) to jointly advance the development of the LLM field. A handful of well-funded startups, such as Cohere and AI21 Labs, also offer pre-trained LLM.

After release, the model is adopted and deployed "downstream" by application-focused developers and enterprises. At this stage, most models require an additional fine-tuning step to fit the specific domain and task. Others, like GPT-3, are more convenient because they can learn various language tasks directly during prediction (zero or few predictions).

Finally, time knocks on the door, and a better model appears around the corner - either with more parameters, more efficient use of hardware, or a more fundamental improvement in modeling human language . Models that lead to substantial innovation can spawn entire model families. For example, BERT lives on in BERT-QA, DistilBERT, and RoBERTa, which are all based on the original architecture.

In the following chapters, we will explore the first two stages of this life cycle - pre-training and fine-tuning for deployment.

3. Pre-training: How LLM was born

Most teams and NLP practitioners will not participate in the pre-training of LLM, but in its fine-tuning and deployment. However, to successfully pick and use a model, it's important to understand what's going on "under the hood." In this section, we will look at the basic ingredients of LLM.

Training data
Input representation
Target before training
Model structure (encoder-decoder)

Each item will not only affect the selection, but also the fine-tuning and deployment of LLM.

1. Training data

Most of the data used for LLM training is text data covering different styles, such as literature, user-generated content and news data. After seeing a variety of different text types, the resulting model becomes aware of the details of the language. In addition to textual data, code is often used as input to teach the model to generate effective programs and code snippets.

As expected, the quality of the training data has a direct impact on the performance of the model - and also on the required size of the model. If you prepare your training data in a smarter way, you can improve the quality of your model while reducing its data size. One example is the T0 model, which is 16 times smaller than GPT-3 but outperforms it on a range of benchmark tasks. Here's the trick: instead of just using any text as training data, it uses the task formula directly, making its learning signal more focused. Figure 3 illustrates some training examples.

Choosing the right language model for NLP

Figure 3: T0 trained on a wide range of well-defined language tasks

Final note on training data: We often hear that language models are based on Supervised training. Although this approach is attractive, it is technically wrong. On the contrary, well-formatted text already provides the necessary learning signals, saving us from the tedious manual data annotation process. The labels to be predicted correspond to past and/or future words in a sentence. As a result, annotation occurs automatically and at scale, enabling relatively rapid progress in the field.

2. Input representation

Once the training data has been assembled, we need to package it into a form that the model can apply. Neural networks are fed with algebraic structures (vectors and matrices), and the best algebraic representation of language is an ongoing search—from simple phrases to containing highly differentiated contextual information. Each new step increases the complexity of natural language and exposes the limitations of current representations.

The basic unit of language is the word. In the early days of NLP, this gave rise to bag-of-word representation, which throws all the words in a text together regardless of their ordering. Look at these two examples.

In the bag-of-words world, these sentences would get exactly the same representation because they are made up of the same words. Obviously, this only encompasses a small part of their meaning.

Sequence representation accommodates information about word order. In deep learning, the processing of sequences was initially implemented in order-aware recurrent neural networks (RNN). However, going one step further, the basic structure of language is not purely sequential but hierarchical. In other words, we are not talking about lists, but trees. Words that are further apart can actually have stronger syntactic and semantic connections than adjacent words. Please see the example below.

Here, she refers to that girl. By the time an RNN reaches the end of the sentence and finally sees her, its memory of the beginning of the sentence may already be fading, thus not allowing it to restore the relationship.

To resolve these long-range dependencies, more complex neural structures have been proposed to build a more discriminative contextual memory. The idea is to keep words related to future predictions in memory and forget the others. This is the contribution of Long Short-Term Memory (LSTM) units and Gated Recurrent Units (GRU). However, these models are not optimized for the specific location to be predicted, but rather for a general future context. Furthermore, due to their complex structure, they are even slower to train than traditional RNNs.

Finally, people abandoned recursion, proposed an attention mechanism, and incorporated it into the Transformer architecture. Attention allows the model to focus back and forth between different words during prediction. Each word is weighted according to its relevance to the specific location to be predicted. For the above sentence, once the model reaches the position of "she", girl has a higher weight than at, even though it is much further away in the linear order.

So far, the attention mechanism is closest to the biological operation of the human brain in information processing. Research shows that attention can learn hierarchical syntactic structures, including a series of complex syntactic phenomena. It also allows parallel computing for faster and more efficient training.

3. Pre-training goals

With appropriate training data representation, our model can start learning. There are three general goals for pre-training language models: sequence-to-sequence conversion, autoregression, and autoencoding. All of this requires the model to have extensive linguistic knowledge.

The original task solved by the encoder-decoder architecture and the Transformer model is sequence-to-sequence conversion: a sequence is converted into a sequence in a different representation framework. The classic sequence-to-sequence task is machine translation, but other tasks, such as summarization, are also often formulated in this way. Note that the target sequence does not have to be text - it can also be other unstructured data, such as images, as well as structured data, such as programming languages. An example of sequence-to-sequence LLMs is the BART series.

The second task is automatic regression, which is also the original language modeling goal. In autoregression, the model learns to predict the next output (token) based on previous tokens. Learning signals are limited by the one-way nature of the enterprise—the model can only use information from the right or left side of the predicted token. This is a major limitation because words can depend on both past and future positions. As an example, consider how the verb written affects the following sentence in both directions.

Choosing the right language model for NLP

Here, the position of the paper is restricted to something writable, and the position of the student is restricted to a human being, or, at any rate, another intelligent entity capable of writing .

Many of the LLMs in today’s headlines are autoregressive, including the GPT series, PaLM, and BLOOM.

The third task - automatic encoding - solves the problem of one-way. Autoencoding is very similar to learning classic word embeddings. First, we corrupt the training data by hiding a certain proportion of tokens in the input (usually 10-20%). The model then learns to reconstruct the correct input based on its surrounding environment, taking into account previous and subsequent markers. A typical example of an autoencoder is the BERT family, where BERT stands for Bidirectional Encoder Representation from Transformers.

4. Model structure (encoder-decoder)

The basic components of the language model are the encoder and the decoder. The encoder transforms the raw input into a high-dimensional algebraic representation, also known as a "hidden" vector. Wait a minute -- hidden? Well, there's actually no big secret at this point. Sure, you can look at the representation, but a lengthy vector of numbers won't convey anything meaningful to a human being. This requires the mathematical intelligence of our model to handle it. The decoder reproduces the hidden representation in an understandable form, such as another language, programming code, image, etc.

Choosing the right language model for NLP

Figure 4: Basic pattern of encoder-decoder architecture

The encoder-decoder architecture was originally introduced for recurrent neural networks. Since the introduction of attention-based Transformer models, traditional recursion has lost its popularity, while the encoder-decoder idea has persisted. Most natural language understanding (NLU) tasks rely on encoders, while natural language generation (NLG) tasks require decoders, and sequence-to-sequence conversion requires both components.

We will not discuss the details of Transformer architecture and attention mechanism here. For those who want to master these details, be prepared to spend a lot of time figuring it out.

4. Using language models in the real world

1. Fine-tuning

Language modeling is a powerful upstream task--if you have a successful language model, congratulations --This is a smart model. Instead, NLP is mostly used for more targeted downstream tasks such as sentiment analysis, question answering, and information extraction. This is when transfer learning is applied and existing language knowledge is reused to address more specific challenges. During fine-tuning, part of the model is "frozen" and the remaining parts are further trained with data from a specific domain or task.

Explicit fine-tuning adds complexity on the road to LLM deployment. It can also lead to model explosion, where each business task requires its own fine-tuned model, leading to an unmaintainable variety of models. Therefore, efforts have been made to use few or zero learning steps to get rid of the fine-tuning step (such as in GPT-3). This learning occurs during the prediction process: the model is provided with "hints" - a task description and possibly a few training examples - to guide its predictions of future instances.

While much faster to implement, the convenience factor of zero or small number of learnings is offset by its lower prediction quality. Additionally, many of these models require access via cloud APIs. Early in development, this may be a welcome opportunity - however, at more advanced stages, it may turn into yet another unwanted external dependency.

2. Pick the right model for the downstream task

Looking at the continuous supply of new language models in the artificial intelligence market, choose the right model for the specific downstream task and keep up with the state-of-the-art technology Synchronization is tricky.

Research papers often benchmark each model on specific downstream tasks and datasets. Standardized task suites, such as SuperGLUE and BIG-bench, allow for unified benchmarking of numerous NLP tasks and provide a basis for comparison. However, we should remember that these tests are prepared in a highly controlled environment. As of today, the generalization capabilities of language models are quite limited - therefore, transferring to real-life datasets can significantly affect the performance of the model. Evaluating and selecting an appropriate model should include conducting experiments on data that is as close as possible to production data.

As a rule of thumb, the pre-training target provides an important tip: autoregressive models perform well in text generation tasks such as conversational AI, question answering, and text summarization, while autoencoders excel at " "understand" and structured languages, e.g. for sentiment analysis and various information extraction tasks. In theory, models used for zero-point learning can perform a variety of tasks as long as they receive appropriate hints - however, their accuracy is usually lower than that of fine-tuned models.

To make things more concrete, the figure below shows how popular NLP tasks relate to language models prominent in the NLP literature. These associations are calculated based on a variety of similarity and aggregation measures, including embedding similarity and distance-weighted co-occurrence. Higher-scoring model-task pairs, such as BART/Text Summarization and LaMDA/Conversational AI, indicate good matches based on historical data.

Choosing the right language model for NLP

Figure 5: Association strength between language model and downstream tasks

5. Main takeaways

In this article, We have covered the basic concepts of LLM and the main dimensions at which innovation is taking place. The table below provides a summary of the main features of the most popular LLMs.

Choosing the right language model for NLP

Table 1: Summary of features of the most popular large-scale language models

Let us summarize the general guidelines for selection and LLM.

1. When evaluating potential models, be clear about where you are in the AI journey.

In the beginning, it might be a good idea to experiment with LLM deployed via cloud API.
Once you find product-market fit, consider hosting and maintaining your model on your end to have more control and further improve the model's performance to suit your application.

2. To align with your downstream tasks, the AI team should create a shortlist of models based on the following criteria.

Focus on downstream tasks and benchmark results in the academic literature

Consistency between pre-training goals and downstream tasks: consider autoencoding for NLGU and autoencoding for NLG return.

Previously reported experience with this model-task combination.

3. Test the shortlisted models to understand real-world tasks and data sets to get an initial feel for performance.

4. In most cases, it is possible to achieve better quality through specialized fine-tuning. However, if you don’t have the in-house technical capabilities or budget for fine-tuning, or you need to cover a large number of tasks, consider few/zero-shot learning.

5.LLM innovations and trends are short-lived. When working with language models, be aware of their life cycle and overall activity in the LLM domain, and be aware of the opportunities to step up your game.

Finally, be aware of the limitations of LLMs. Although they have an amazing, human-like ability to produce language, their overall cognitive abilities fall short of those of us humans. The world knowledge and reasoning capabilities of these models are strictly limited to the information they find on the surface of language. They also fail to keep facts in time and may provide you with outdated information without blinking an eye. If you are building an application that relies on generating up-to-date or even raw knowledge, consider combining your LLM with additional multimodal, structured, or dynamic knowledge sources.

Original link: https://www.topbots.com/choosing-the-right-language-model/

Translator introduction

Cui Hao, 51CTO community editor , a senior architect with 18 years of experience in software development and architecture, and 10 years of experience in distributed architecture.

The above is the detailed content of Choosing the right language model for NLP. For more information, please follow other related articles on the PHP Chinese website!