Breaking the Impossible Triangle and competing with 540 billion models, the IDEA Fengshen List team only achieves zero-sample learning SOTA with 200 million models-AI-php.cn

Table of Contents

Home

Breaking the Impossible Triangle and competing with 540 billion models, the IDEA Fengshen List team only achieves zero-sample learning SOTA with 200 million models

王林

Apr 09, 2023 pm 01:31 PM

parameter Model

Since the advent of GPT-3, which demonstrated the power of hundreds of billions of models, NLP tasks have faced the impossible triangle of scale, samples, and fine-tuning performance. How can a language model with less than 1 billion parameters achieve SOTA's Few-Shot (or even Zero-shot) and Fine-tuning performance? Do we have to have hundreds of billions of parameters and endure unstable prompts to solve the zero-shot scenario? In this article, the IDEA Research Institute Fengshenbang team introduces a new "phenomenological" UniMC, which can achieve zero-shot SOTA with only 200 million parameters. Related work has been accepted by EMNLP 2022.

pointed out in an article [1] this year that since pre-training technology was proposed, there has been an impossible triangle in the NLP world (Figure 1 below), that is, a model cannot simultaneously satisfy :

Medium model size (under 1 billion);
SOTA’s Few-Shot (or even Zero-shot) performance ;
SOTA’s Fine-tuning performance.

Breaking the Impossible Triangle and competing with 540 billion models, the IDEA Fengshen List team only achieves zero-sample learning SOTA with 200 million models

##Figure 1

The reason why the impossible triangle exists Yes, the number of parameters of the current pre-trained model only reaches a certain order of magnitude, and only when hint learning is used can strong few/zero-shot performance be demonstrated.

The paper recently published by our Fengshenbang team and included in EMNLP 2022: "Zero-Shot Learners for Natural Language Understanding via a Unified Multiple Choice Perspective" breaks this "curse" and provides A flexible and efficient solution. The UniMC proposed in our paper has a very small number of model parameters (only hundreds of millions) and SOTA's fine-tuning capabilities. It can also have SOTA (equivalent to the 540 billion PaLM). Few/Zero-Shot performance.

Breaking the Impossible Triangle and competing with 540 billion models, the IDEA Fengshen List team only achieves zero-sample learning SOTA with 200 million models

Paper address: https://arxiv.org/abs/2210.08590
Model open source address: https://github.com/IDEA-CCNL/Fengshenbang-LM/tree/main/fengshen/examples/unimc/

Technical Background

The introduction of BERT in 2018 marked that the entire NLP field has entered a pre-training era, and NLP has finally made a further step forward. Existing models such as DeBERTa and other pre-trained masked language models (PMLM) can already achieve fine-tuning SOTA with parameters below 1 billion, but they are weak when facing NLU tasks in zero-shot scenarios.

The reason is that when using PMLM, we need to add an MLP layer on top for specific tasks, as shown in Figure 2(c). Moreover, this MLP layer will add additional parameters, which makes this method only choose random initialization when facing zero-shot scenarios, and there is no way to obtain reasonable output. Moreover, in the finetuning scenario, adding an MLP layer will also make it impossible to transfer between different tasks (for example, it is impossible to transfer between 2-classification and 3-classification tasks).

For Zero-shot scenarios, the mainstream approach in recent years is to use tens or even hundreds of billions of pre-trained language models (PLM) to uniformly convert NLU tasks into text generation tasks, so that Large models can be applied to zero-shot tasks by manually constructing prompts or manually designing verbalizers, as shown in Figure 2(a). Furthermore, in the FLAN paper, a large number of artificially constructed templates are used to unify different tasks, so that the knowledge of other tasks can be transferred to specific tasks, as shown in Figure 2(b). However, such a generative model has the following shortcomings:

Generating the model requires generating a verbalizer (label description), and the verbalizer is usually written manually. Different verbalizers will lead to large performance differences;
Prompts also require manual design. Different prompts will greatly affect the effect of downstream tasks;
When the generation model is inferring, it needs autoregression to generate answers, which is slow. And it is generally one-way, and cannot obtain two-way information like BERT;
In order to ensure few/zero-shot performance, the amount of generated model parameters is often large, reaching GPT-3 175 billion or PaLM's 540 billion;
Although FLAN's Instruction tuning can transfer knowledge from other tasks to specific tasks, new training is required to face different tasks. For example, when evaluating A, you need to train on BCDE; when evaluating B, you need to train on ACDE.

We proposed the UniMC method in Figure 2(d), which avoids the above problems and achieves SOTA or is comparable to the state-of-the-art in several Chinese and English tasks. Model-like performance.

Breaking the Impossible Triangle and competing with 540 billion models, the IDEA Fengshen List team only achieves zero-sample learning SOTA with 200 million models

Figure 2

UniMC (a new model phenotype)

Model Ideas

Most NLU tasks are based on labels, and the generative model needs to generate labels. This is undoubtedly This increases the difficulty of the task and the learning cost of the model. For many label-based tasks, it is usually only necessary to give the input text and the probability that the output text belongs to each label. Based on this idea, we transform the NLU task into a multiple-choice task (Multiple-Choice). That is, given text, questions and options, output the probability of each option without generating the options.

Based on this, we propose a new concept: The phenotype of the model. Existing model expressions always add a certain layer later, such as a classification layer. Alternatively, the phenotype of the generated model GPT is to mine the knowledge of the model through Prompt. The UniMC solution we proposed does not require the introduction of any additional layers in PMLM and explores another phenotype of PMLM.

In this paper, we choose ALBERT as our backbone PMLM network.

Uniform multiple choice format

As shown in Figure 3, we hope to convert all label-based NLU tasks into a unified MC (Multiple-Choice) format. Our philosophy is to add as little human information as possible.

Breaking the Impossible Triangle and competing with 540 billion models, the IDEA Fengshen List team only achieves zero-sample learning SOTA with 200 million models

##Figure 3

Specifically, we did the following two Steps:

Change label to option;
Choose whether to add a question prompt (question basically comes from the description of the data set).

Advantages: Only one option prompt is designed, and one or no question prompt is designed.

Model structure

The structure of UniMC is shown in Figure 4 below. It uses self-encoding similar to BERT structure. The main process is that we first unify the inputs of different tasks and limit the flow of input information. After PMLM, we use O-MLM, OP and MLM for MC training, and finally use O-MLM and OP for zero- shot prediction. Next I will break down our solution step by step.

Breaking the Impossible Triangle and competing with 540 billion models, the IDEA Fengshen List team only achieves zero-sample learning SOTA with 200 million models

Figure 4

##Input Input

As shown in Figure 5, the content of the red solid line box area. Before inputting to UniMC, it needs to be processed and turned into UniMC's unique token format. In order to improve calculation efficiency, we directly splice all options with questions and text, that is, [Options, Question, Passage]. And we insert a special token in front of each option, [O-MASK], to indicate yes or no (select this option or not). (Note, in order to improve reusability, we reused the [MASK] token.

As shown in Figure 5, the content of the green dotted box area. We need to consider that there are too many input information sources and there are options Information, question information and text segment information. The information between them will affect each other, so we hope to isolate different information. For example, if we can see other options when typing, then the difficulty of this question will decrease , the model will be inert.

So we made the following considerations:

Use Segment ID to tell the model option and context (question, passage) information is different;
Modify the Postion ID, the model needs to treat the location information of different options equally;
Modify Attention Mask matrix prevents the model from seeing information about different options, causing the model to become inert.

Breaking the Impossible Triangle and competing with 540 billion models, the IDEA Fengshen List team only achieves zero-sample learning SOTA with 200 million models

Figure 5

How does the model do multiple choice questions? (O-MLM and OP)

As shown in Figure 6, we use O -MLM and OP tasks to allow the model to "select" the answer. O-MASK is completely inherited from the MASK token (specifically, in order not to add additional parameters and make full use of the knowledge learned by the model in the unsupervised pre-training stage, we Reuses the parameters of the MaskLM head). The only difference is that it is 100% masked. The goal of the O-MLM task is to decode the O-MASK into 'yes' or 'no', which is used to predict whether the option is selected.

The role of the OP task is to predict the answer from the 'yes' of each option. Specifically, we take the 'yes' of each [O-MASK] output Use logit to perform softmax to get the probability of each option, and choose the option with the highest probability as the predicted answer.

Breaking the Impossible Triangle and competing with 540 billion models, the IDEA Fengshen List team only achieves zero-sample learning SOTA with 200 million models

Figure 6

Processing multiple MC tasks in one Batch

As shown in Figure 7, we hope to process multiple MC tasks in one batch Putting multiple MC data sets into it can enhance the capabilities of the model and make it more unified. When we were building the batch, we discovered a problem: What if there are samples with different options in a batch?

So we designed a logit mask method in front of the output. By directly assigning a negative infinity predicted value to irrelevant tokens, and adding them up, we can eliminate the impact of other tokens on O-MASK when calculating softmax. Moreover, different numbers of multiple-choice questions can be processed uniformly in one batch.

Breaking the Impossible Triangle and competing with 540 billion models, the IDEA Fengshen List team only achieves zero-sample learning SOTA with 200 million models

##Figure 7Model training and prediction

MC Training

Different from FLAN's Instruction Tuning, we only train on the MC data set. This is mainly to allow the model to learn how to do multiple-choice questions, and the MC data set has certain versatility, such as different data Sets may consist of varying numbers of tags.

Breaking the Impossible Triangle and competing with 540 billion models, the IDEA Fengshen List team only achieves zero-sample learning SOTA with 200 million models

##Figure 8

Zero-shot Inference

Interestingly, we can find that these two tasks can be consistent in the two stages of training and zero-shot inference. This is because we use two tasks, O-MLM and OP, to let the model do multiple-choice questions. And since we abandoned the classification layer, all parameters can be reused, thus activating the Zero-shot capability of PMLM.

Breaking the Impossible Triangle and competing with 540 billion models, the IDEA Fengshen List team only achieves zero-sample learning SOTA with 200 million models

##Figure 9UniMC Performance

English scenario

#We collected 14 multiple-choice tasks for pre-training, and then performed other NLU tasks for zero-shot performance testing. In 4 NLI tasks, UniMC achieves SOTA and surpasses the 540 billion parameter PaLM model.

Breaking the Impossible Triangle and competing with 540 billion models, the IDEA Fengshen List team only achieves zero-sample learning SOTA with 200 million models

Figure 10

And we

Defeated the network with GPT-2 and GPT-3 as its backbone in the classification task. For the very difficult Dbpedia task, up to 13 categories, an even ultra-high accuracy of 88.9% can be achieved.

Breaking the Impossible Triangle and competing with 540 billion models, the IDEA Fengshen List team only achieves zero-sample learning SOTA with 200 million models

Figure 11

#In order to explore the generalization of UNIMC, we Comparison was made with FLAN. As can be seen, our UniMC can surpass or come close to FLAN in almost all tasks.

Breaking the Impossible Triangle and competing with 540 billion models, the IDEA Fengshen List team only achieves zero-sample learning SOTA with 200 million models

##Picture 12

Chinese scene

In the Chinese scenario, we collected 40 supervised data sets and unified them into MC task forms to pre-train the UniMC model, and then performed 9 tasks on FewCLUE and ZeroCLUE Test on. As of August 30, 2022,

UniMC has ranked first in both FewCLUE and ZeroCLUE lists (Erlangshen in the picture - UnifiedMC is UniMC).

Breaking the Impossible Triangle and competing with 540 billion models, the IDEA Fengshen List team only achieves zero-sample learning SOTA with 200 million models ##Figure 13

Breaking the Impossible Triangle and competing with 540 billion models, the IDEA Fengshen List team only achieves zero-sample learning SOTA with 200 million models

##Figure 14

SummaryWe proposed a novel solution to the NLU task in the Zero-shot scenario , using only hundreds of millions of parameters, it defeated a complex large model with a thousand times the number of parameters.

In addition, we introduce almost no artificial information. And it overcomes the problem of inconsistency between pre-training and fine-tuning of BERT-type models, and our training and prediction are consistent. We can even perform one training and multiple zero-shot predictions, which greatly saves computing power costs. Currently, the IDEA Fengshenban team has launched more than 70 pre-trained large models.

Model: https://huggingface.co/IDEA-CCNL
Fengshenlist Overall thesis (bilingual in Chinese and English): https://arxiv.org/abs/2209.02970
Fengshenbang homepage: https://github.com/IDEA- CCNL/Fengshenbang-LM

citation

##[1]Impossible Triangle: What's Next for Pre-trained Language Models?https://readpaper.com/paper/4612531641570566145

The above is the detailed content of Breaking the Impossible Triangle and competing with 540 billion models, the IDEA Fengshen List team only achieves zero-sample learning SOTA with 200 million models. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks ago By DDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

2 weeks ago By DDD

Where to find the Crane Control Keycard in Atomfall

3 weeks ago By DDD

Assassin's Creed Shadows - How To Find The Blacksmith And Unlock Weapon And Armour Customisation

4 weeks ago By DDD

Roblox: Dead Rails - How To Complete Every Challenge

3 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7579

CakePHP Tutorial

1386

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

111

Related knowledge

The world's most powerful open source MoE model is here, with Chinese capabilities comparable to GPT-4, and the price is only nearly one percent of GPT-4-Turbo May 07, 2024 pm 04:13 PM

Imagine an artificial intelligence model that not only has the ability to surpass traditional computing, but also achieves more efficient performance at a lower cost. This is not science fiction, DeepSeek-V2[1], the world’s most powerful open source MoE model is here. DeepSeek-V2 is a powerful mixture of experts (MoE) language model with the characteristics of economical training and efficient inference. It consists of 236B parameters, 21B of which are used to activate each marker. Compared with DeepSeek67B, DeepSeek-V2 has stronger performance, while saving 42.5% of training costs, reducing KV cache by 93.3%, and increasing the maximum generation throughput to 5.76 times. DeepSeek is a company exploring general artificial intelligence

AI subverts mathematical research! Fields Medal winner and Chinese-American mathematician led 11 top-ranked papers | Liked by Terence Tao Apr 09, 2024 am 11:52 AM

AI is indeed changing mathematics. Recently, Tao Zhexuan, who has been paying close attention to this issue, forwarded the latest issue of "Bulletin of the American Mathematical Society" (Bulletin of the American Mathematical Society). Focusing on the topic "Will machines change mathematics?", many mathematicians expressed their opinions. The whole process was full of sparks, hardcore and exciting. The author has a strong lineup, including Fields Medal winner Akshay Venkatesh, Chinese mathematician Zheng Lejun, NYU computer scientist Ernest Davis and many other well-known scholars in the industry. The world of AI has changed dramatically. You know, many of these articles were submitted a year ago.

Google is ecstatic: JAX performance surpasses Pytorch and TensorFlow! It may become the fastest choice for GPU inference training Apr 01, 2024 pm 07:46 PM

The performance of JAX, promoted by Google, has surpassed that of Pytorch and TensorFlow in recent benchmark tests, ranking first in 7 indicators. And the test was not done on the TPU with the best JAX performance. Although among developers, Pytorch is still more popular than Tensorflow. But in the future, perhaps more large models will be trained and run based on the JAX platform. Models Recently, the Keras team benchmarked three backends (TensorFlow, JAX, PyTorch) with the native PyTorch implementation and Keras2 with TensorFlow. First, they select a set of mainstream

Hello, electric Atlas! Boston Dynamics robot comes back to life, 180-degree weird moves scare Musk Apr 18, 2024 pm 07:58 PM

Boston Dynamics Atlas officially enters the era of electric robots! Yesterday, the hydraulic Atlas just "tearfully" withdrew from the stage of history. Today, Boston Dynamics announced that the electric Atlas is on the job. It seems that in the field of commercial humanoid robots, Boston Dynamics is determined to compete with Tesla. After the new video was released, it had already been viewed by more than one million people in just ten hours. The old people leave and new roles appear. This is a historical necessity. There is no doubt that this year is the explosive year of humanoid robots. Netizens commented: The advancement of robots has made this year's opening ceremony look like a human, and the degree of freedom is far greater than that of humans. But is this really not a horror movie? At the beginning of the video, Atlas is lying calmly on the ground, seemingly on his back. What follows is jaw-dropping

KAN, which replaces MLP, has been extended to convolution by open source projects Jun 01, 2024 pm 10:03 PM

Earlier this month, researchers from MIT and other institutions proposed a very promising alternative to MLP - KAN. KAN outperforms MLP in terms of accuracy and interpretability. And it can outperform MLP running with a larger number of parameters with a very small number of parameters. For example, the authors stated that they used KAN to reproduce DeepMind's results with a smaller network and a higher degree of automation. Specifically, DeepMind's MLP has about 300,000 parameters, while KAN only has about 200 parameters. KAN has a strong mathematical foundation like MLP. MLP is based on the universal approximation theorem, while KAN is based on the Kolmogorov-Arnold representation theorem. As shown in the figure below, KAN has

FisheyeDetNet: the first target detection algorithm based on fisheye camera Apr 26, 2024 am 11:37 AM

Target detection is a relatively mature problem in autonomous driving systems, among which pedestrian detection is one of the earliest algorithms to be deployed. Very comprehensive research has been carried out in most papers. However, distance perception using fisheye cameras for surround view is relatively less studied. Due to large radial distortion, standard bounding box representation is difficult to implement in fisheye cameras. To alleviate the above description, we explore extended bounding box, ellipse, and general polygon designs into polar/angular representations and define an instance segmentation mIOU metric to analyze these representations. The proposed model fisheyeDetNet with polygonal shape outperforms other models and simultaneously achieves 49.5% mAP on the Valeo fisheye camera dataset for autonomous driving

Tesla robots work in factories, Musk: The degree of freedom of hands will reach 22 this year! May 06, 2024 pm 04:13 PM

The latest video of Tesla's robot Optimus is released, and it can already work in the factory. At normal speed, it sorts batteries (Tesla's 4680 batteries) like this: The official also released what it looks like at 20x speed - on a small "workstation", picking and picking and picking: This time it is released One of the highlights of the video is that Optimus completes this work in the factory, completely autonomously, without human intervention throughout the process. And from the perspective of Optimus, it can also pick up and place the crooked battery, focusing on automatic error correction: Regarding Optimus's hand, NVIDIA scientist Jim Fan gave a high evaluation: Optimus's hand is the world's five-fingered robot. One of the most dexterous. Its hands are not only tactile

$The latest from Oxford University! Mickey: 2D image matching in 3D SOTA! (CVPR\'24)$ The latest from Oxford University! Mickey: 2D image matching in 3D SOTA! (CVPR\'24) Apr 23, 2024 pm 01:20 PM

Project link written in front: https://nianticlabs.github.io/mickey/ Given two pictures, the camera pose between them can be estimated by establishing the correspondence between the pictures. Typically, these correspondences are 2D to 2D, and our estimated poses are scale-indeterminate. Some applications, such as instant augmented reality anytime, anywhere, require pose estimation of scale metrics, so they rely on external depth estimators to recover scale. This paper proposes MicKey, a keypoint matching process capable of predicting metric correspondences in 3D camera space. By learning 3D coordinate matching across images, we are able to infer metric relative

See all articles