A preliminary exploration into the evolution of natural language pre-training technology-AI-php.cn

Three levels of artificial intelligence:

Computing functions: data storage and computing capabilities, machines are far better than humans.

Perceptual functions: vision, hearing and other abilities. Machines are already comparable to humans in the fields of speech recognition and image recognition.

Cognitive intelligence: For tasks such as natural language processing, common sense modeling and reasoning, machines still have a long way to go.

Natural language processing belongs to the category of cognitive intelligence. Because natural language has the characteristics of abstraction, combination, ambiguity, knowledge, and evolution, it brings great challenges to machine processing. Some people use natural language to process natural language. Language processing is called the crown jewel of artificial intelligence. In recent years, pre-trained language models represented by BERT have emerged, bringing natural language processing into a new era: pre-trained language models fine-tuned for specific tasks. This article attempts to sort out the evolution of natural language pre-training technology, with a view to communicating and learning with everyone. We welcome criticism and correction of shortcomings and fallacies.

1. Ancient - Word Representation

1.1 One-hot Encoding

Use a vector of the size of a vocabulary to represent a word, where the value of the corresponding position of the word is 1, and the remaining positions are 0. Disadvantages:

High-dimensional sparsity
Cannot express semantic similarity: the One-hot vector similarity of two synonyms is 0

1.2 Distribution Expression

Distributed semantics hypothesis: similar words have similar contexts, and the semantics of words can be represented by context. Based on this idea, the context distribution of each word can be used to represent words.

1.2.1 Word frequency representation

Based on the corpus, the context of the word is used to construct a co-occurrence frequency table. Each row of the word table represents the vector representation of a word. Different language information can be captured through different context selections. For example, if the words in the fixed window around the word in the sentence are used as the context, more local information of the word will be captured: lexical and syntactic information. If the document is used as the context, Capture more of the topic information represented by the word. Disadvantages:

High frequency word problem.
Cannot reflect higher-order relationships: (A, B) (B, C) (C, D) !=> (A, D).
There is still a sparsity problem.

1.2.2 TF-IDF representation

Replace the value in word frequency representation with TF-IDF, which mainly alleviates the problem of high-frequency words in word frequency representation.

1.2.3 Point mutual information representation

It also alleviates the high-frequency word problem of word frequency representation. The value in the word frequency representation is replaced by the point mutual information of the word:

A preliminary exploration into the evolution of natural language pre-training technology

1.2.4 LSA

By performing Singular Value Decomposition (SVD) on the word frequency matrix, a low-dimensional, continuous, dense vector representation of each word can be obtained , can be considered to represent the latent semantics of the word, this method is also called latent semantic analysis (Latent Semantic Analysis, LSA).

A preliminary exploration into the evolution of natural language pre-training technology

LSA alleviates problems such as high-frequency words, high-order relationships, sparsity, etc., and the effect is still good in traditional machine learning algorithms, but there are also some shortcomings:

When the vocabulary list is large, SVD is slower.
Unable to catch up with new ones. When the corpus changes or new corpus is added, it needs to be retrained.

2. Modern times - static word vectors

The orderliness of text and the co-occurrence relationship between words provide natural self-supervised learning signals for natural language processing , enabling the system to learn knowledge from text without additional manual annotation.

2.1 Word2Vec

2.1.1 CBOW

CBOW(Continous Bag-of-Words) uses the context (window) to predict the target word, and combines the words of the context words The vectors are arithmetic averaged and then the probability of the target word is predicted.

A preliminary exploration into the evolution of natural language pre-training technology

2.1.2 Skip-gram

Skip-gram predicts context by word.

A preliminary exploration into the evolution of natural language pre-training technology

2.2 GloVe

GloVe (Global Vectors for Word Representation) uses word vectors to predict the co-occurrence matrix of words and implements implicit matrix decomposition . First, a distance-weighted co-occurrence matrix X is constructed based on the context window of the word, and then the vector of the word and context is used to fit the co-occurrence matrix X:

A preliminary exploration into the evolution of natural language pre-training technology

The loss function is:

A preliminary exploration into the evolution of natural language pre-training technology

##2.3 Summary

Learning and utilization of word vectors In addition to the co-occurrence information between words in the corpus, the underlying idea is still the distributed semantic hypothesis. Whether it is Word2Vec based on local context or GloVe based on explicit global co-occurrence information, the essence is to aggregate the co-occurrence context information of a word in the entire corpus into the vector representation of the word, and have achieved good results. , the training speed is also very fast, but the vector of shortcomings is static, that is, it does not have the ability to change with context changes.

3. Modern - Pre-trained language model

Autoregressive language model: Calculate the conditional probability of the word at the current moment based on the sequence history.

A preliminary exploration into the evolution of natural language pre-training technology

#Auto-encoding language model: reconstruct masked words through context.

A preliminary exploration into the evolution of natural language pre-training technology

represents the masked sequence

3.1 Cornerstone——Transformer

A preliminary exploration into the evolution of natural language pre-training technology ##3.1.1 Attention model

The attention model can be understood as a mechanism for weighting a vector sequence and the calculation of weight.

A preliminary exploration into the evolution of natural language pre-training technology 3.1.2 Multi-Head Self-Attention

The attention model used in Transformer can be expressed as:

A preliminary exploration into the evolution of natural language pre-training technology When Q, K, and V come from the same vector sequence, it becomes a self-attention model.

Multi-head self-attention: Set up multiple groups of self-attention models, splice their output vectors, and map them to the dimension size of the Transformer hidden layer through a linear mapping. The multi-head self-attention model can be understood as an ensemble of multiple self-attention models.

A preliminary exploration into the evolution of natural language pre-training technology

##3.1.3 Position encoding A preliminary exploration into the evolution of natural language pre-training technology

Since the self-attention model does not consider the position information of the input vector, but the position Information is critical for sequence modeling. Position information can be introduced through position embedding or position encoding. Transformer uses position encoding.

3.1.4 Others A preliminary exploration into the evolution of natural language pre-training technology

In addition, residual connection, Layer Normalization and other technologies are also used in the Transformer block.

3.1.5 Advantages and Disadvantages

Advantages:

Compared with RNN, it can model longer-distance dependencies, and the attention mechanism will The distance between words is reduced to 1, resulting in stronger ability to model long sequence data.

Strong expressive ability.
Disadvantages:

Compared with RNN, the parameters are larger, which increases the difficulty of training and requires more training data.

3.2 Autoregressive Language Model

3.2.1 ELMo

ELMo: Embeddings from Language Models

Input Layer

The word embedding can be used directly, or the character sequence in the word can be used through CNN or other models.

Model structure

A preliminary exploration into the evolution of natural language pre-training technology

ELMo independently models forward and backward language models through LSTM, forward language model:

A preliminary exploration into the evolution of natural language pre-training technology

Backward language model:

A preliminary exploration into the evolution of natural language pre-training technology

Optimization goal

Maximization:

A preliminary exploration into the evolution of natural language pre-training technology

Downstream Application

After ELMo is trained, the following vectors can be obtained for use in downstream tasks.

A preliminary exploration into the evolution of natural language pre-training technology

is the word embedding obtained by the input layer, and is the result of splicing the forward and backward LSTM outputs.

When used in downstream tasks, the vectors of each layer can be weighted to obtain a vector representation of ELMo, and a weight can be used to scale the ELMo vector.

A preliminary exploration into the evolution of natural language pre-training technology

Different levels of hidden layer vectors contain text information at different levels or granularities:

The top layer encodes more semantic information
The bottom layer encodes more lexical and syntactic information

3.2.2 GPT series

GPT-1

Model structure

In GPT-1 (Generative Pre-Training), it is a one-way language model that uses 12 transformer block structures as decoders. Each transformer block is a multi-head self-attention mechanism. , and then obtain the probability distribution of the output through full connection.

A preliminary exploration into the evolution of natural language pre-training technology

U: One-hot vector of word
We: Word vector matrix
Wp: Position vector matrix

Optimization goal

Maximization:

A preliminary exploration into the evolution of natural language pre-training technology

Downstream application

In the downstream task, for a labeled data set, each instance has an input token:, which consists of the label. First, these tokens are input into the trained pre-training model to obtain the final feature vector. Then the prediction result is obtained through a fully connected layer:

A preliminary exploration into the evolution of natural language pre-training technology

The goal of the downstream supervised task is to maximize:

A preliminary exploration into the evolution of natural language pre-training technology

In order to prevent catastrophic forgetting problems, a certain weight of pre-training loss can be added to the fine-tuning loss, usually pre-training loss.

A preliminary exploration into the evolution of natural language pre-training technology

GPT-2

The core idea of GPT-2 can be summarized as: any supervised task is a subset of the language model. When the capacity of the model is very large and the amount of data is rich enough, training alone The learning of language models can complete other supervised learning tasks. Therefore, GPT-2 did not carry out too many structural innovations and designs on the GPT-1 network. It just used more network parameters and a larger data set. The goal was to train a word vector with stronger generalization ability. Model.

Among the 8 language model tasks, GPT-2 has 7 surpassed the state-of-the-art methods at the time through zero-shot learning alone (of course, some tasks are still not as good as the supervised model) good). The biggest contribution of GPT-2 is to verify that word vector models trained with massive data and a large number of parameters can be transferred to other categories of tasks without additional training.

At the same time, GPT-2 showed that as the model capacity and training data volume (quality) increase, there is room for further development of its potential. Based on this idea, GPT-3 was born.

GPT-3

There is still no change in the model structure, but the model capacity, training data volume and quality are increased. It is known as a giant, and the effect is also very good.

Summary

A preliminary exploration into the evolution of natural language pre-training technology

From GPT-1 to GPT-3, as the model capacity and the amount of training data increase, the language knowledge learned by the model also increases. Rich, the paradigm of natural language processing has gradually changed from "pre-training model fine-tuning" to "pre-training model zero-shot/few-shot learning". The disadvantage of GPT is that it uses a one-way language model. BERT has proven that a two-way language model can improve the model effect.

3.2.3 XLNet

XLNet introduces two-way contextual information through the permutation language model (Permutation Language Model). It does not introduce special tags and avoids inconsistent token distribution in the pre-training and fine-tuning phases. The problem. At the same time, Transformer-XL is used as the main structure of the model, which has better effects on long texts.

Permutation language model

The goal of the permutation language model is:

A preliminary exploration into the evolution of natural language pre-training technology

is the set of all possible permutations of the text sequence.

Two-stream self-attention mechanism

The purpose of the two-stream self-attention mechanism (Two-stream Self-attention) is: by transforming the Transformer, when inputting a normal text sequence , implement the permutation language model:

Content representation: the information contained
Query representation: only the information contained

A preliminary exploration into the evolution of natural language pre-training technology

This method uses the position information of the predicted word.

Downstream application

When applying downstream tasks, no query representation is required, and no mask is required.

3.3 Self-encoding language model

3.3.1 BERT

Masked language model

Masked language model (MLM), random Partially masking words, and then using contextual information to make predictions. There is a problem with MLM, there is a mismatch between pre-training and fine-tuning, because the [MASK] token is never seen during fine-tuning. To solve this problem, BERT does not always replace the "masked" word piece token with the actual [MASK] token. The training data generator randomly selects 15% of the tokens and then:

80% probability: replaces them with the [MASK] token.
10% probability: replace with a random token from the vocabulary list.
10% probability: token remains unchanged.

In native BERT, tokens are masked, and whole words or phrases (N-Gram) can be masked.

Next sentence prediction

Next sentence prediction (NSP): When sentences A and B are selected as pre-training samples, B has a 50% chance of being the next sentence of A, and a 50% chance may be random sentences from the corpus.

Input layer

A preliminary exploration into the evolution of natural language pre-training technology

Model structure

A preliminary exploration into the evolution of natural language pre-training technology

The classic "pre-training model fine-tuning" Paradigm,theme structure is stacked multi-layer Transformers.

3.3.2 RoBERTa

RoBERTa (Robustly Optimized BERT Pretraining Approach) does not drastically improve BERT, but only conducts detailed experiments on every design detail of BERT to find room for improvement of BERT.

Dynamic mask: The original method is to set the mask and fix it when building the data set. The improved method is to randomly mask the data when entering the data into the model in each round of training, which increases the accuracy of the data. Diversity.
Abandon NSP tasks: Experiments have proven that not using NSP tasks can improve performance for most tasks.
More training data, larger batches, and longer pre-training steps.
Larger vocabulary: Using the byte-level BPE vocabulary of SentencePiece instead of the character-level BPE vocabulary of WordPiece, there will be almost no unregistered words.

3.3.3 ALBERT

BERT has a relatively large number of parameters. The main goal of ALBERT (A Lite BERT) is to reduce the number of parameters:

BERT words The vector dimension is the same as the hidden layer dimension, and word vectors are context-independent. However, BERT's Transformer layer needs and can learn sufficient contextual information, so the hidden layer vector dimension should be much larger than the word vector dimension. When increasing to improve performance, there is no need to increase the size because the word vector space may be sufficient for the amount of information that needs to be embedded.
Solution: The word vector is transformed into H dimension through the fully connected layer.
Word vector parameter decomposition (Factorized embedding parameterization).
Cross-layer parameter sharing: Transformer blocks of different layers share parameters.
Sentence-order prediction (SOP), learning subtle semantic differences and discourse coherence.

3.4 Generative Confrontation - ELECTRA

ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately) introduces the model of generator and discriminator, transforming the generative Masked language model into (MLM) pre-training task was changed to a discriminative Replaced token detection (RTD) task, which determines whether the current token has been replaced by the language model, which is similar to the idea of GAN.

A preliminary exploration into the evolution of natural language pre-training technology

The generator predicts the token at the mask position in the input text:

A preliminary exploration into the evolution of natural language pre-training technology

The input of the discriminator is the output of the generator, and the discriminator predicts whether the words at each position have been replaced:

A preliminary exploration into the evolution of natural language pre-training technology

In addition, some optimizations have been made:

The generator and discriminator are each a BERT, which scales the generator BERT parameters.
Word vector parameter decomposition.
Generator and discriminator parameter sharing: input layer parameter sharing, including word vector matrix and position vector matrix.

Only use the discriminator, not the generator, in downstream tasks.

3.5 Long Text Processing - Transformer-XL

Transformer A common strategy for processing long text is to split the text into fixed-length blocks and encode each block independently, without any interruption between blocks. Information exchange.

A preliminary exploration into the evolution of natural language pre-training technology

In order to optimize the modeling of long text, Transformer-XL uses two technologies: Segment-Level Recurrence with State Reuse and Relative Positional Encodings.

3.5.1 Block-level loop of state multiplexing

Transformer-XL is also input in the form of fixed-length segments during training. The difference is that Transformer-XL’s previous The state of the fragment is cached and then the hidden state of the previous time slice is reused when calculating the current segment, giving Transformer-XL the ability to model longer-term dependencies.

Two consecutive segments of length L and. The state of the hidden layer node is expressed as, where d is the dimension of the hidden layer node. The calculation process of the status of the hidden layer node is:

A preliminary exploration into the evolution of natural language pre-training technology

Another benefit of fragment recursion is the improvement in reasoning speed. Compared with Transformer's autoregressive architecture, which can only advance one time slice at a time, Transformer-XL's reasoning process directly reuses the representation of the previous fragment instead of Calculate from scratch and improve the reasoning process to reasoning in fragments.

A preliminary exploration into the evolution of natural language pre-training technology

3.5.2 Relative position encoding

In Transformer, the self-attention model can be expressed as:

A preliminary exploration into the evolution of natural language pre-training technology ## The complete expression of

# is:

A preliminary exploration into the evolution of natural language pre-training technology

The problem with Transformer is that no matter which fragment it is, their position encoding is are the same, that is to say, the Transformer's position encoding is absolute position encoding relative to the fragment, and has nothing to do with the relative position of the current content in the original sentence.

Transfomer-XL made several changes based on the above formula and obtained the following calculation method:

A preliminary exploration into the evolution of natural language pre-training technology

Change 2: In, absolute position encoding is replaced by relative position encoding.
Change 3: Two new learnable parameters are introduced to replace the query vector in Transformer. Indicates that the corresponding query position vectors are the same for all query positions. That is, regardless of the query position, the attention bias for different words remains consistent.
After improvement, the meaning of each part:

Content-related position offset (): Calculate the association information between the content of query and the position code of key
Global content offset (): Calculate the association between the position code of query and the content of key Information
Global position offset (): Calculate the associated information between query and key position coding

3.6 Distillation and compression-DistillBert

Knowledge distillation technology (Knowledge Distillation, KD): It usually consists of a teacher model and a student model. It transfers knowledge from the teacher model to the student model so that the student model is as close as possible to the teacher model. In practical applications, the student model is often required to be smaller and more basic than the teacher model. Maintain the effect of the original model.

DistillBert’s student model:

Use the first six layers of the teacher model for initialization.
Only use the mask language model for training, and do not use the NSP task.

Teacher model: BERT-base:

Loss function:

Supervised MLM loss: using mask Cross-entropy loss obtained from code language model training:

A preliminary exploration into the evolution of natural language pre-training technology

and represent the hidden layer output of the last layer of the teacher model and student model respectively.

Final loss:

A preliminary exploration into the evolution of natural language pre-training technology

4. References

https ://www.php.cn/link/6e2290dbf1e11f39d246e7ce5ac50a1e

https://www.php.cn/link/664c7298d2b73b3c7fe2d1e8d1781c06

https://www.php.cn/link/67b878df6cd42d142f2924f3ace85c78

https://www.php.cn/link/f6a673f09493afcd8b129a0bcf1cd5bc

https://www.php.cn/link/82599a4ec94aca066873c99b4c741ed8

https://www. php.cn/link/2e64da0bae6a7533021c760d4ba5d621##

https://www.php.cn/link/56d33021e640f5d64a611a71b5dc30a3

https://www.php.cn/link/4e38d30e656da5ae9d3a425109ce9e04

https://www.php.cn/link/c055dcc749c2632fd4dd806301f05ba6

https://www.php.cn/link/a749e38f556d5eb1dc13b9221d1f994f

https://www.php.cn/link /8ab9bb97ce35080338be74dc6375e0ed

##https://www.php.cn/link/4f0bf7b7b1aca9ad15317a0b4efdca14

https:/ /www.php.cn/link/b81132591828d622fc335860bffec150

##https://www.php.cn/link/fca758e52635df5a640f7063ddb9cdcb

https://www.php.cn/link/5112277ea658f7138694f079042cc3bb

https://www.php.cn/link/257deb66f5366aab34a23d5fd0571da4

https://www.php.cn/link/b18e8fb514012229891cf024b6436526

https://www.php. cn/link/836a0dcbf5d22652569dc3a708274c16

https://www.php.cn/link/a3de03cb426b5e36f5c7167b21395323

https://www.php.cn/link/831b342d8a83408e5960e9b0c5f31f0c

https://www.php.cn/link/6b27e88fdd7269394bca4968b48d8df4

https://www.php.cn/link/682e0e796084e163c5ca053dd8573b0c##

https://www.php.cn/link/9739efc4f01292e764c86caa59af353e

https://www.php.cn/link/b93e78c67fd4ae3ee626d8ec0c412dec

https://www .php.cn/link/c8cc6e90ccbff44c9cee23611711cdc4

The above is the detailed content of A preliminary exploration into the evolution of natural language pre-training technology. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Repo: How To Revive Teammates

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Hello Kitty Island Adventure: How To Get Giant Seeds

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

How Long Does It Take To Beat Split Fiction?

3 weeks ago By DDD

R.E.P.O. Save File Location: Where Is It & How to Protect It?

3 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7338

Java Tutorial

1627

CakePHP Tutorial

1352

Laravel Tutorial

1265

PHP Tutorial

1210

Related knowledge

Introduction to five sampling methods in natural language generation tasks and Pytorch code implementation Feb 20, 2024 am 08:50 AM

In natural language generation tasks, sampling method is a technique to obtain text output from a generative model. This article will discuss 5 common methods and implement them using PyTorch. 1. GreedyDecoding In greedy decoding, the generative model predicts the words of the output sequence based on the input sequence time step by time. At each time step, the model calculates the conditional probability distribution of each word, and then selects the word with the highest conditional probability as the output of the current time step. This word becomes the input to the next time step, and the generation process continues until some termination condition is met, such as a sequence of a specified length or a special end marker. The characteristic of GreedyDecoding is that each time the current conditional probability is the best

Understand Tokenization in one article! Apr 12, 2024 pm 02:31 PM

Language models reason about text, which is usually in the form of strings, but the input to the model can only be numbers, so the text needs to be converted into numerical form. Tokenization is a basic task of natural language processing. It can divide a continuous text sequence (such as sentences, paragraphs, etc.) into a character sequence (such as words, phrases, characters, punctuation, etc.) according to specific needs. The units in it Called a token or word. According to the specific process shown in the figure below, the text sentences are first divided into units, then the single elements are digitized (mapped into vectors), then these vectors are input to the model for encoding, and finally output to downstream tasks to further obtain the final result. Text segmentation can be divided into Toke according to the granularity of text segmentation.

To provide a new scientific and complex question answering benchmark and evaluation system for large models, UNSW, Argonne, University of Chicago and other institutions jointly launched the SciQAG framework Jul 25, 2024 am 06:42 AM

Editor |ScienceAI Question Answering (QA) data set plays a vital role in promoting natural language processing (NLP) research. High-quality QA data sets can not only be used to fine-tune models, but also effectively evaluate the capabilities of large language models (LLM), especially the ability to understand and reason about scientific knowledge. Although there are currently many scientific QA data sets covering medicine, chemistry, biology and other fields, these data sets still have some shortcomings. First, the data form is relatively simple, most of which are multiple-choice questions. They are easy to evaluate, but limit the model's answer selection range and cannot fully test the model's ability to answer scientific questions. In contrast, open-ended Q&A

Three secrets for deploying large models in the cloud Apr 24, 2024 pm 03:00 PM

Compilation|Produced by Xingxuan|51CTO Technology Stack (WeChat ID: blog51cto) In the past two years, I have been more involved in generative AI projects using large language models (LLMs) rather than traditional systems. I'm starting to miss serverless cloud computing. Their applications range from enhancing conversational AI to providing complex analytics solutions for various industries, and many other capabilities. Many enterprises deploy these models on cloud platforms because public cloud providers already provide a ready-made ecosystem and it is the path of least resistance. However, it doesn't come cheap. The cloud also offers other benefits such as scalability, efficiency and advanced computing capabilities (GPUs available on demand). There are some little-known aspects of deploying LLM on public cloud platforms

How to do basic natural language generation using PHP Jun 22, 2023 am 11:05 AM

Natural language generation is an artificial intelligence technology that converts data into natural language text. In today's big data era, more and more businesses need to visualize or present data to users, and natural language generation is a very effective method. PHP is a very popular server-side scripting language that can be used to develop web applications. This article will briefly introduce how to use PHP for basic natural language generation. Introducing the natural language generation library The function library that comes with PHP does not include the functions required for natural language generation, so

Conveniently trained the biggest ViT in history? Google upgrades visual language model PaLI: supports 100+ languages Apr 12, 2023 am 09:31 AM

The progress of natural language processing in recent years has largely come from large-scale language models. Each new model released pushes the amount of parameters and training data to new highs, and at the same time, the existing benchmark rankings will be slaughtered. ! For example, in April this year, Google released the 540 billion-parameter language model PaLM (Pathways Language Model), which successfully surpassed humans in a series of language and reasoning tests, especially its excellent performance in few-shot small sample learning scenarios. PaLM is considered to be the development direction of the next generation language model. In the same way, visual language models actually work wonders, and performance can be improved by increasing the scale of the model. Of course, if it is just a multi-tasking visual language model

Efficient parameter fine-tuning of large-scale language models--BitFit/Prefix/Prompt fine-tuning series Oct 07, 2023 pm 12:13 PM

In 2018, Google released BERT. Once it was released, it defeated the State-of-the-art (Sota) results of 11 NLP tasks in one fell swoop, becoming a new milestone in the NLP world. The structure of BERT is shown in the figure below. On the left is the BERT model preset. The training process, on the right is the fine-tuning process for specific tasks. Among them, the fine-tuning stage is for fine-tuning when it is subsequently used in some downstream tasks, such as text classification, part-of-speech tagging, question and answer systems, etc. BERT can be fine-tuned on different tasks without adjusting the structure. Through the task design of "pre-trained language model + downstream task fine-tuning", it brings powerful model effects. Since then, "pre-training language model + downstream task fine-tuning" has become the mainstream training in the NLP field.

Traffic Engineering doubles code generation accuracy: from 19% to 44% Feb 05, 2024 am 09:15 AM

The authors of a new paper propose a way to "enhance" code generation. Code generation is an increasingly important capability in artificial intelligence. It automatically generates computer code based on natural language descriptions by training machine learning models. This technology has broad application prospects and can transform software specifications into usable code, automate back-end development, and assist human programmers to improve work efficiency. However, generating high-quality code remains challenging for AI systems, compared with language tasks such as translation or summarization. The code must accurately conform to the syntax of the target programming language, handle edge cases and unexpected inputs gracefully, and handle the many small details of the problem description accurately. Even small bugs that may seem innocuous in other areas can completely disrupt the functionality of a program, causing

See all articles