ICML2024 high score! Magically modify attention, allowing small models to fight twice as big models-AI-php.cn

Table of Contents

What are the limitations of the multi-head attention module?

What does the dynamic combination of bull attention look like?

△Figure 1. Overall structure of DCMHA

Downstream task evaluation

△Table 1. Performance of DCFormer and Pythia in downstream tasks

Training and inference speed

△Table 2. Comparison of training and inference speeds between Transformer++ and DCFormer++

Ablation experiment

△Table 3. Ablation experiment of DCMHA

Home

Technology peripherals

ICML2024 high score! Magically modify attention, allowing small models to fight twice as big models

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jun 10, 2024 pm 08:18 PM

AI small model DCMHA

Improve the core mechanism of Transformer to focus, so that small models can play twice as big models!

ICML+2024 high-scoring paper, Caiyun Technology team built the DCFormer framework, replacing the Transformer core component attention module (MHA), and proposed a dynamically combined multi-head attention (DCMHA).

DCMHA removes the fixed binding of the search selection loop and transformation loop of the MHA attention head, allowing them to be dynamically combined based on input, which fundamentally improves the expression ability of the model.

The original meaning is that each layer has fixed H attention heads. Now it is almost understood that each layer has fixed H attention heads. Now it uses almost the same parameter amount and calculation. Power can dynamically combine up to HxH attention heads. The fine-tuned content can more clearly express the meaning of the original text, as follows: Each layer of the original model contains a fixed number of H attention heads. Now we can use

DCMHA plug-and-play to replace MHA in any Transformer architecture to obtain a new universal, efficient and scalable model. ArchitectureDCFormer.

ICML2024 high score! Magically modify attention, allowing small models to fight twice as big models

This work was jointly completed by researchers from Beijing University of Posts and Telecommunications and AI startup Caiyun Technology.

The model DCPythia-6.9B built by the researchers based on DCFormer is better than the open source Pythia-12B in terms of pre-training perplexity and downstream task evaluation.

The DCFormer model is comparable in performance to those Transformer models that require 1.7-2 times more calculations.

ICML2024 high score! Magically modify attention, allowing small models to fight twice as big models

What are the limitations of the multi-head attention module?

The scaling law of large models tells us that with the improvement of computing power, the model will be larger and have more data, and the model effect will become better and better. Although no one can clearly explain how high the ceiling of this road is and whether it can reach AGI, this is indeed the most common approach at present.

But in addition to this, another question is also worth thinking about: Most of the current large models are based on Transformer. They are built up one by one with Transformer blocks like building blocks. As a building block, Transformer In itself, how much room for improvement is there?

This is the basic question to be answered in model structure research, and it is also the starting point of the DCFormer work jointly completed by Caiyun Technology and Beijing University of Posts and Telecommunications.

In Transformer's multi-head attention module (MHA), each attention head works completely independently of each other.

This design has been very successful in practice because of its simplicity and ease of implementation. However, it also brings about the low-ranking of the attention score matrix, which weakens the expressive ability and the repetitive and redundant waste of the attention head function. It eliminates some disadvantages such as parameters and computing resources. Based on this, some research works in recent years have tried to introduce some form of interaction between attention heads.

According to the Transformer loop theory, in MHA, the behavior of each attention head is composed of W^Q, W^K, W^V, W^O four weight matrices describe (W^O is obtained by cutting the output projection matrix of MHA) .

Among them, W^QW^K is called the QK loop (or search selection loop) , which determines which item in the context to focus on from the current token (some)token, for example:

ICML2024 high score! Magically modify attention, allowing small models to fight twice as big models

W^OW^V It is called the OV loop (or projection transformation loop) , which determines what information is retrieved from the token of concern (or what attributes are projected) is written into the residual stream at the current position, and then predicted Next token. For example:

ICML2024 high score! Magically modify attention, allowing small models to fight twice as big models

The researchers noticed that search (where to get it from) and transformation (what to get) are originally two independent things, and they should be able to specify and Free combination on demand (just like in SQL query, the selection conditions after WHERE and the attribute projection after SELECT are written separately), MHA forces them to be "bundled" in QKOV with one attention head, which limits Flexibility and expressiveness.

For example, suppose there is a model with attention heads A, B, and C whose QK and OV loops can complete the above example =, then replace it with:

ICML2024 high score! Magically modify attention, allowing small models to fight twice as big models

It is necessary to cross-combine the QK and OV loops of the existing attention heads, and the model may "not be able to turn a corner" (verified by the synthetic test set constructed by the researcher's system,

What does the dynamic combination of bull attention look like?

With this as a starting point, the research team of this article introduced the compose operation in MHA:

ICML2024 high score! Magically modify attention, allowing small models to fight twice as big models

As shown in the figure below, DCMHA is obtained:

△Figure 1. Overall structure of DCMHA

The attention calculated by QW^Q and KW^K The score matrix A^S and the attention weight matrix A^W are linearly mapped on the num_heads dimension before being multiplied with VW^V to obtain a new matrix A' , through different linear mapping matrices (composition map) , to achieve the effects of various attention head combinations.

For example, in Figure 2(c), the QK loops of heads 3 and 7 are combined with the OV loop of head 1 to form a "new" attention head.

ICML2024 high score! Magically modify attention, allowing small models to fight twice as big models

##△Figure 2. Simplified typical composition map functions of 8 attention heads, light colors represent large values

In order to maximize the expression ability, researchers hope that the mapping matrix is dynamically generated

from the input , that is, dynamically determines how the attention heads are combined.

But the mapping matrix they want to generate is not one, but for each pair of query Q

i_{at the source position and key K}j_{at the destination position in the sequence. To generate such a matrix, the computational overhead and memory usage will be unacceptable.}

To this end, they further decompose the mapping matrix into an input-independent static matrix W

b_{and a low-rank matrix w}1_w2_{and a diagonal matrix Diag(w}g_{), which are respectively responsible for the basic combination, the dynamic combination of the limited way} (i.e. rank R between attention heads, and the head itself Dynamic gating (see Figure 2(d) and Figure 3(b)). The latter two matrices are dynamically generated by the Q matrix and the K matrix.

Reduce the calculation and parameter complexity to an almost negligible level without sacrificing the effect

(See the complexity analysis in the paper for details). Combined with JAX and PyTorch implementation-level optimization, DCFormer can train and infer efficiently.

△Figure 3. How is the calculation of Compose

Scale expansion

To evaluate the quality of an architecture, the core indicator that researchers focus on is the efficiency of converting computing power into intelligence

(or performance computing power ratio) , that is, the model performance improvement that can be brought about by investing unit computing power - spending less computing power to get a better model.

From the scaling law curves in Figure 4 and Figure 5

(In logarithmic coordinates, the loss of each model architecture can be drawn as an approximate straight line as the computing power changes. The lower the loss, the better the model. Good) It can be seen that DCFormer can achieve the effect of the Transformer model with 1.7~2 times the computing power, that is, the intelligent conversion rate of the computing power is increased by 1.7~2 times.

△Figure 4. Scale expansion effect of Transformer and DCFormer

△Figure 5. Scale of Pythia and DCPythia Extension effect

How do you understand this improvement?

Since the birth of Transformer in 2017, from the perspective of improving performance and computing power ratio, GLU MLP and rotational position encoding RoPE are two of the few architectural improvements that have been proven to be universally effective and widely adopted in a large number of practices. .

The architecture that adds these two improvements to the original Transformer is also called Transformer++. The strongest open source models such as Llama and Mistral all use this architecture. Regardless of the Transformer or Transformer++ architecture, significant improvements can be obtained through DCMHA.

At the 1.4B model scale, the improvement of DCMHA is greater than the sum of the two improvements of Transformer++, and the scalability is better (comparison of the blue-green line and the black line in Figure 4, the improvement of DCMHA can be calculated as Force increases and decays more slowly, and comparison of Figures 4 and 5).

It can be said that DCFormer takes Transformer's capabilities to a new level.

Downstream task evaluation

The research team trained two models, DCPythia-2.8B and DCPythia-6.9B, to evaluate on mainstream NLP downstream tasks and compared them with the open source model Pythia of the same scale( Training uses the same hyperparameter settings as Pythia).

△Table 1. Performance of DCFormer and Pythia in downstream tasks

As can be seen from Table 1, DCPythia-2.8B and 6.9B are not only The ppl on the Pile validation set is lower, and it significantly exceeds Pythia on most downstream tasks. The average accuracy of DCPythia6.9B on ppl and downstream tasks even exceeds Pythia-12B.

DCFormer++2.8B is further improved compared to DCPythia-2.8B, verifying the effectiveness of the combination of DCMHA and Lllama architecture.

Training and inference speed

Although the introduction of DCMHA will bring additional training and inference overhead, it can be seen from Table 2 that the training speed of DCFormer++ is 74.5%-89.2% of Transformer++. The inference speed is 81.1%-89.7%, and as the model parameters increase, the additional computing overhead will gradually decrease.

△Table 2. Comparison of training and inference speeds between Transformer++ and DCFormer++

The training speed is in TPU v3 pod, the sequence length is 2048, and the batch_size is 1k Comparison obtained under the circumstances; the inference speed is evaluated on the A100 80G GPU, the input length is 1024, and the generation length is 128.

Ablation experiment

The results are as follows:

△Table 3. Ablation experiment of DCMHA

From Table 3 The following points can be seen:

Although adding static combination weights can reduce ppl, introducing dynamic combination weights can further reduce ppl, which illustrates the necessity of dynamic combination.
Low-rank dynamic combination performs better than dynamic gating.
The ppl obtained by using only query-wise or key-wise dynamic combination is very similar, and the gap with DCFormer++ is very small.
Doing attention head combination after softmax is more effective than doing it before softmax, probably because the probability after softmax can more directly affect the output.
The rank of the dynamic combination weight does not need to be set too large, which also illustrates the low rank of the combination weight.

In addition, the researchers also further reduced training and inference overhead by increasing the proportion of local attention layers and only using query-wise dynamic combination. See Table 10 of the paper for details.

In general, the research team has two conclusions.

About dynamic weights: Recent SSM and linear attention/RNN work such as Mamba, GLA, RWKV6, HGRN, etc. have caught up with Transformer++ by introducing dynamic (input-dependent) weights, but DCFormer uses dynamic The method of combining attention heads shows that when using softmax attention, the effect of Transformer++ can be greatly improved by introducing dynamic weights.

About model architecture innovation: This work shows that if there is an "ideal model architecture" with extreme computing power and intelligent transformation efficiency, although the current Transformer architecture is very powerful, it is probably still far from this ideal architecture. There is a big gap and there is still vast room for improvement. Therefore, in addition to the vigorous development of miracles by stacking computing power and data, innovation in model architecture also has great potential.

The research team also stated that Caiyun Technology will be the first to apply DCformer on its products Caiyun Weather, Caiyun Xiaoyi, and Caiyun Xiaomeng.

For more research details, please refer to the original paper.

ICML2024 paper link: https://icml.cc/virtual/2024/poster/34047.
Arxiv paper link: https://arxiv.org/abs/2405.08553.
Code link: https://github.com/Caiyun-AI/DCFormer.

The above is the detailed content of ICML2024 high score! Magically modify attention, allowing small models to fight twice as big models. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

How to fix KB5055523 fails to install in Windows 11?

4 weeks ago By DDD

How to fix KB5055518 fails to install in Windows 10?

4 weeks ago By DDD

Roblox: Grow A Garden - Complete Mutation Guide

3 weeks ago By DDD

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

How to fix KB5055612 fails to install in Windows 10?

3 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Java Tutorial

1664

CakePHP Tutorial

1422

Laravel Tutorial

1316

PHP Tutorial

1266

C# Tutorial

1239

Related knowledge

Bytedance Cutting launches SVIP super membership: 499 yuan for continuous annual subscription, providing a variety of AI functions Jun 28, 2024 am 03:51 AM

This site reported on June 27 that Jianying is a video editing software developed by FaceMeng Technology, a subsidiary of ByteDance. It relies on the Douyin platform and basically produces short video content for users of the platform. It is compatible with iOS, Android, and Windows. , MacOS and other operating systems. Jianying officially announced the upgrade of its membership system and launched a new SVIP, which includes a variety of AI black technologies, such as intelligent translation, intelligent highlighting, intelligent packaging, digital human synthesis, etc. In terms of price, the monthly fee for clipping SVIP is 79 yuan, the annual fee is 599 yuan (note on this site: equivalent to 49.9 yuan per month), the continuous monthly subscription is 59 yuan per month, and the continuous annual subscription is 499 yuan per year (equivalent to 41.6 yuan per month) . In addition, the cut official also stated that in order to improve the user experience, those who have subscribed to the original VIP

Context-augmented AI coding assistant using Rag and Sem-Rag Jun 10, 2024 am 11:08 AM

Improve developer productivity, efficiency, and accuracy by incorporating retrieval-enhanced generation and semantic memory into AI coding assistants. Translated from EnhancingAICodingAssistantswithContextUsingRAGandSEM-RAG, author JanakiramMSV. While basic AI programming assistants are naturally helpful, they often fail to provide the most relevant and correct code suggestions because they rely on a general understanding of the software language and the most common patterns of writing software. The code generated by these coding assistants is suitable for solving the problems they are responsible for solving, but often does not conform to the coding standards, conventions and styles of the individual teams. This often results in suggestions that need to be modified or refined in order for the code to be accepted into the application

Seven Cool GenAI & LLM Technical Interview Questions Jun 07, 2024 am 10:06 AM

To learn more about AIGC, please visit: 51CTOAI.x Community https://www.51cto.com/aigc/Translator|Jingyan Reviewer|Chonglou is different from the traditional question bank that can be seen everywhere on the Internet. These questions It requires thinking outside the box. Large Language Models (LLMs) are increasingly important in the fields of data science, generative artificial intelligence (GenAI), and artificial intelligence. These complex algorithms enhance human skills and drive efficiency and innovation in many industries, becoming the key for companies to remain competitive. LLM has a wide range of applications. It can be used in fields such as natural language processing, text generation, speech recognition and recommendation systems. By learning from large amounts of data, LLM is able to generate text

Can fine-tuning really allow LLM to learn new things: introducing new knowledge may make the model produce more hallucinations Jun 11, 2024 pm 03:57 PM

Large Language Models (LLMs) are trained on huge text databases, where they acquire large amounts of real-world knowledge. This knowledge is embedded into their parameters and can then be used when needed. The knowledge of these models is "reified" at the end of training. At the end of pre-training, the model actually stops learning. Align or fine-tune the model to learn how to leverage this knowledge and respond more naturally to user questions. But sometimes model knowledge is not enough, and although the model can access external content through RAG, it is considered beneficial to adapt the model to new domains through fine-tuning. This fine-tuning is performed using input from human annotators or other LLM creations, where the model encounters additional real-world knowledge and integrates it

Five schools of machine learning you don't know about Jun 05, 2024 pm 08:51 PM

Machine learning is an important branch of artificial intelligence that gives computers the ability to learn from data and improve their capabilities without being explicitly programmed. Machine learning has a wide range of applications in various fields, from image recognition and natural language processing to recommendation systems and fraud detection, and it is changing the way we live. There are many different methods and theories in the field of machine learning, among which the five most influential methods are called the "Five Schools of Machine Learning". The five major schools are the symbolic school, the connectionist school, the evolutionary school, the Bayesian school and the analogy school. 1. Symbolism, also known as symbolism, emphasizes the use of symbols for logical reasoning and expression of knowledge. This school of thought believes that learning is a process of reverse deduction, through existing

To provide a new scientific and complex question answering benchmark and evaluation system for large models, UNSW, Argonne, University of Chicago and other institutions jointly launched the SciQAG framework Jul 25, 2024 am 06:42 AM

Editor |ScienceAI Question Answering (QA) data set plays a vital role in promoting natural language processing (NLP) research. High-quality QA data sets can not only be used to fine-tune models, but also effectively evaluate the capabilities of large language models (LLM), especially the ability to understand and reason about scientific knowledge. Although there are currently many scientific QA data sets covering medicine, chemistry, biology and other fields, these data sets still have some shortcomings. First, the data form is relatively simple, most of which are multiple-choice questions. They are easy to evaluate, but limit the model's answer selection range and cannot fully test the model's ability to answer scientific questions. In contrast, open-ended Q&A

SK Hynix will display new AI-related products on August 6: 12-layer HBM3E, 321-high NAND, etc. Aug 01, 2024 pm 09:40 PM

According to news from this site on August 1, SK Hynix released a blog post today (August 1), announcing that it will attend the Global Semiconductor Memory Summit FMS2024 to be held in Santa Clara, California, USA from August 6 to 8, showcasing many new technologies. generation product. Introduction to the Future Memory and Storage Summit (FutureMemoryandStorage), formerly the Flash Memory Summit (FlashMemorySummit) mainly for NAND suppliers, in the context of increasing attention to artificial intelligence technology, this year was renamed the Future Memory and Storage Summit (FutureMemoryandStorage) to invite DRAM and storage vendors and many more players. New product SK hynix launched last year

SOTA performance, Xiamen multi-modal protein-ligand affinity prediction AI method, combines molecular surface information for the first time Jul 17, 2024 pm 06:37 PM

Editor | KX In the field of drug research and development, accurately and effectively predicting the binding affinity of proteins and ligands is crucial for drug screening and optimization. However, current studies do not take into account the important role of molecular surface information in protein-ligand interactions. Based on this, researchers from Xiamen University proposed a novel multi-modal feature extraction (MFE) framework, which for the first time combines information on protein surface, 3D structure and sequence, and uses a cross-attention mechanism to compare different modalities. feature alignment. Experimental results demonstrate that this method achieves state-of-the-art performance in predicting protein-ligand binding affinities. Furthermore, ablation studies demonstrate the effectiveness and necessity of protein surface information and multimodal feature alignment within this framework. Related research begins with "S

See all articles