Common tasks! Tsinghua proposes backbone network Flowformer to achieve linear complexity

Table of Contents

3.1 Attention mechanism from the perspective of network flow

3.2 Flow-Attention" >3.2 Flow-Attention

5. Analysis

6. Summary

Home

Technology peripherals

Common tasks! Tsinghua proposes backbone network Flowformer to achieve linear complexity | ICML2022

王林

Apr 16, 2023 pm 07:25 PM

network Model Tsinghua University

Task universality is one of the core goals of basic model research, and it is also the only way for deep learning research to lead to advanced intelligence. In recent years, thanks to the universal key modeling capabilities of the attention mechanism, Transformer has performed well in many fields and has gradually shown a trend of universal architecture. However, as the length of the sequence increases, the calculation of the standard attention mechanism exhibits quadratic complexity, which seriously hinders its application in long sequence modeling and large models.

To this end, a team from the School of Software, Tsinghua University deeply explored this key issue and proposed a task-universal linear complexity backbone network Flowformer, while maintaining the versatility of the standard Transformer. At the same time, its complexity was reduced to linear, and the paper was accepted by ICML 2022.

Common tasks! Tsinghua proposes backbone network Flowformer to achieve linear complexity | ICML2022

## Author list: Wu Haixu, Wu Jialong, Xu Jiehui, Wang Jianmin, Long Mingsheng

Link: https://arxiv.org/pdf/2202.06258.pdf

Code: https://github.com/thuml/ Flowformer

Compared with the standard Transformer, the Flowformer model proposed in this article has the following characteristics:

Linear complexity, can handle input sequences of thousands of lengths;
does not introduce new inductive preferences, maintaining the universality of the original attention mechanism Modeling ability;
Universal tasks, and achieved excellence in the five major tasks of long sequences, vision, natural language, time series, and reinforcement learning Effect.

1. Problem analysis

The standard attention mechanism input contains three parts: queries(), keys() and values(), and its calculation method As follows: where is the attention weight matrix, and the final calculation result will be obtained by weighted fusion. The computational complexity of the above process is. It is noted that there have been many studies on the problem of continuous multiplication of multinomial matrices in classical algorithms. In particular, for the attention mechanism, we can use the associative law of matrix multiplication to achieve optimization, for example, the original quadratic complexity can be reduced to linear. But the function in the attention mechanism makes it impossible to apply the associative law directly. Therefore, how to remove functions in the attention mechanism is the key to achieving linear complexity. However, much recent work has demonstrated that functions play a key role in avoiding trivial attentional learning. In summary, we look forward to a model design solution that achieves the following goals: (1) remove functions; (2) avoid trivial attention; (3) maintain the versatility of the model.

2. Motivation

In view of goal (1), in previous work, the kernel method is often used to replace the function, that is, through approximate attention calculation (for non- linear function), but removing it directly would cause trivial attention. To this end, for goal (2), previous work had to introduce some inductive preferences, which limited the versatility of the model , and therefore did not meet goal (3), such as the locality assumption in cosFormer.

Competition mechanism in Softmax

In order to meet the above goals, we analyze it based on the basic properties of . We note that it was originally proposed to extend the "winner-take-all" maximum operation into a differentiable form. Therefore, thanks to its inherent "competition" mechanism, it can differentiate the attention weights between various tokens, thereby avoiding ordinary attention problems. Based on the above considerations, we try to introduce the competition mechanism into the attention mechanism design, so as to avoid the trivial attention problems caused by kernel method decomposition.

Competition mechanism in network flow

We pay attention to the "Conservation"## in the classic network flow (Flow network) model in graph theory. #(Conservation) is an important phenomenon, that is, the inflow of each node is equal to the outflow. Inspired by "Fixed resources will inevitably cause competition", in this article, we try to re-analyze the information flow in the classic attention mechanism from the perspective of network flow, and convert competition through conservation properties Introduce attention mechanism design to avoid ordinary attention problems. 3. Flowformer

3.1 Attention mechanism from the perspective of network flow

Inside the attention mechanism: the flow of information can be expressed as: from

Source (source, corresponding) is gathered to sink (sink, corresponding) based on the learned flow capacity (flow capacity, corresponding attention weight).

Common tasks! Tsinghua proposes backbone network Flowformer to achieve linear complexity | ICML2022

Outside the attention mechanism, the information of the source (v) comes from the upper layer of the network, and the information of the sink (R) will also be provided to the feed-forward layer below.

Common tasks! Tsinghua proposes backbone network Flowformer to achieve linear complexity | ICML2022

3.2 Flow-Attention

Based on the above observations, we can from the inflow From the two perspectives of flow and outflow, we control the interaction between the attention mechanism and the external network to achieve "fixed resources", thereby causing competition within the source and sink respectively to avoid ordinary attention. Without loss of generality, we set the amount of interaction information between the attention mechanism and the external network to the default value 1.

Common tasks! Tsinghua proposes backbone network Flowformer to achieve linear complexity | ICML2022

(1) The inflow conservation of the sink (R):

is not difficult to obtain. Before conservation, for the th sink, the amount of information flowing in is: Common tasks! Tsinghua proposes backbone network Flowformer to achieve linear complexity | ICML2022 . In order to fix the amount of information flowing into each sink to unit 1, we introduce as a normalization in the calculation of the information flow (attention weight). After normalization, the inflow information amount of the th sink is: Common tasks! Tsinghua proposes backbone network Flowformer to achieve linear complexity | ICML2022

#At this time, due to the conservation of the inflow of the sink, there is natural competition between the various sources (V) Relationship, we calculate the amount of information provided by each source (V) at this time, and we can get: the amount of information provided by each source under competition, which also represents the importance of each source.

Common tasks! Tsinghua proposes backbone network Flowformer to achieve linear complexity | ICML2022

(2) Conservation of outflow from source (V): Similar to the aforementioned process, before conservation, for the source, the amount of information flowing out of it is: Common tasks! Tsinghua proposes backbone network Flowformer to achieve linear complexity | ICML2022 . In order to fix the amount of information flowing out of each source to unit 1, we will introduce the calculation of the information flow (attention weight) as a normalization. After normalization, the amount of outflow information from the jth source is: Common tasks! Tsinghua proposes backbone network Flowformer to achieve linear complexity | ICML2022 . At this time, due to the conservation of outflow from the source, there is a natural competition relationship between the sinks (). We calculate the amount of information received by each sink () at this time, and we can get: In the case of competition, the final required for each result is The amount of information received.

(3) Overall design

Based on the above results, we design the following Flow-Attention mechanism, specifically including competition (Competition), aggregation (Aggregation), and allocation (Allocation) three parts: Competition introduces the competition mechanism to highlight important information; Aggregation realizes linear complexity based on the matrix associative law; Allocation introduces the competition mechanism and transfers control to the next step. One layer of information. All operations in the above process have linear complexity. At the same time, the design of Flow-Attention only relies on the conservation principle in network flow and reintegrates information flow. Therefore, it does not introduce new inductive preferences, ensuring the versatility of the model. Flowformer is obtained by replacing the quadratic complexity Attention in the standard Transformer with Flow-Attention.

4. Experiments

This paper conducts extensive experiments on standard data sets:

covers Five major tasks: long sequence, vision, natural language, time series, and reinforcement learning;
examines two types of attention mechanisms: normal (Normal) and autoregressive tasks (Causal).
Covers input situations of various sequence lengths (20-4000).
Compares various baseline methods such as classic models in various fields, mainstream deep models, Transformer and its variants.

Common tasks! Tsinghua proposes backbone network Flowformer to achieve linear complexity | ICML2022

As shown in the table below, Flowformer performed well on all five tasks, verifying the versatility of the model. Please see the paper for detailed experimental results.

Common tasks! Tsinghua proposes backbone network Flowformer to achieve linear complexity | ICML2022

5. Analysis

In order to further explain the working principle of Flowformer, we conducted a visual experiment on the attention in the ImageNet classification task (corresponding to Flow-Attention), from which we can find:

If you only use the kernel method for decomposition, such as Linear Transformer, the model will be distracted and unable to effectively capture key areas;
Both classic Transformer and Flowformer can accurately capture the key positions of the image, but the latter has an advantage in computational complexity;
cosFormer introduces one-dimensional locality in the attention mechanism Hypothetically, the effect is outstanding on language tasks. But in images (unfolding 2D data into 1D sequences), it cannot be adapted to vision tasks without extending the locality assumption to two dimensions. This also confirms the advantage of the design method in Flowformer that "does not introduce new inductive preferences".

Common tasks! Tsinghua proposes backbone network Flowformer to achieve linear complexity | ICML2022

The above visualization shows that introducing competition into the attention mechanism design through Flow-Attention can effectively avoid trivial attention. More visualization experiments can be found in the paper.

6. Summary

The Flowformer proposed in this article introduces the conservation principle in network flow into the design, and naturally introduces the competition mechanism into the attention calculation, effectively avoiding It solves the trivial attention problem and maintains the versatility of the standard Transformer while achieving linear complexity. Flowformer has achieved excellent results in five major tasks: long sequence, vision, natural language, time series, and reinforcement learning. In addition, the design concept of "no special induction preference" in Flowformer is also inspiring to the research of general infrastructure. In future work, we will further explore the potential of Flowformer for large-scale pre-training.

The above is the detailed content of Common tasks! Tsinghua proposes backbone network Flowformer to achieve linear complexity | ICML2022. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

2 weeks ago By DDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

WWE 2K25: How To Unlock Everything In MyRise

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7495

CakePHP Tutorial

1377

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

Related knowledge

The world's most powerful open source MoE model is here, with Chinese capabilities comparable to GPT-4, and the price is only nearly one percent of GPT-4-Turbo May 07, 2024 pm 04:13 PM

Imagine an artificial intelligence model that not only has the ability to surpass traditional computing, but also achieves more efficient performance at a lower cost. This is not science fiction, DeepSeek-V2[1], the world’s most powerful open source MoE model is here. DeepSeek-V2 is a powerful mixture of experts (MoE) language model with the characteristics of economical training and efficient inference. It consists of 236B parameters, 21B of which are used to activate each marker. Compared with DeepSeek67B, DeepSeek-V2 has stronger performance, while saving 42.5% of training costs, reducing KV cache by 93.3%, and increasing the maximum generation throughput to 5.76 times. DeepSeek is a company exploring general artificial intelligence

AI subverts mathematical research! Fields Medal winner and Chinese-American mathematician led 11 top-ranked papers | Liked by Terence Tao Apr 09, 2024 am 11:52 AM

AI is indeed changing mathematics. Recently, Tao Zhexuan, who has been paying close attention to this issue, forwarded the latest issue of "Bulletin of the American Mathematical Society" (Bulletin of the American Mathematical Society). Focusing on the topic "Will machines change mathematics?", many mathematicians expressed their opinions. The whole process was full of sparks, hardcore and exciting. The author has a strong lineup, including Fields Medal winner Akshay Venkatesh, Chinese mathematician Zheng Lejun, NYU computer scientist Ernest Davis and many other well-known scholars in the industry. The world of AI has changed dramatically. You know, many of these articles were submitted a year ago.

What's going on when the network can't connect to the wifi? Apr 03, 2024 pm 12:11 PM

1. Check the wifi password: Make sure the wifi password you entered is correct and pay attention to case sensitivity. 2. Confirm whether the wifi is working properly: Check whether the wifi router is running normally. You can connect other devices to the same router to determine whether the problem lies with the device. 3. Restart the device and router: Sometimes, there is a malfunction or network problem with the device or router, and restarting the device and router may solve the problem. 4. Check the device settings: Make sure the wireless function of the device is turned on and the wifi function is not disabled.

Hello, electric Atlas! Boston Dynamics robot comes back to life, 180-degree weird moves scare Musk Apr 18, 2024 pm 07:58 PM

Boston Dynamics Atlas officially enters the era of electric robots! Yesterday, the hydraulic Atlas just "tearfully" withdrew from the stage of history. Today, Boston Dynamics announced that the electric Atlas is on the job. It seems that in the field of commercial humanoid robots, Boston Dynamics is determined to compete with Tesla. After the new video was released, it had already been viewed by more than one million people in just ten hours. The old people leave and new roles appear. This is a historical necessity. There is no doubt that this year is the explosive year of humanoid robots. Netizens commented: The advancement of robots has made this year's opening ceremony look like a human, and the degree of freedom is far greater than that of humans. But is this really not a horror movie? At the beginning of the video, Atlas is lying calmly on the ground, seemingly on his back. What follows is jaw-dropping

KAN, which replaces MLP, has been extended to convolution by open source projects Jun 01, 2024 pm 10:03 PM

Earlier this month, researchers from MIT and other institutions proposed a very promising alternative to MLP - KAN. KAN outperforms MLP in terms of accuracy and interpretability. And it can outperform MLP running with a larger number of parameters with a very small number of parameters. For example, the authors stated that they used KAN to reproduce DeepMind's results with a smaller network and a higher degree of automation. Specifically, DeepMind's MLP has about 300,000 parameters, while KAN only has about 200 parameters. KAN has a strong mathematical foundation like MLP. MLP is based on the universal approximation theorem, while KAN is based on the Kolmogorov-Arnold representation theorem. As shown in the figure below, KAN has

Google is ecstatic: JAX performance surpasses Pytorch and TensorFlow! It may become the fastest choice for GPU inference training Apr 01, 2024 pm 07:46 PM

The performance of JAX, promoted by Google, has surpassed that of Pytorch and TensorFlow in recent benchmark tests, ranking first in 7 indicators. And the test was not done on the TPU with the best JAX performance. Although among developers, Pytorch is still more popular than Tensorflow. But in the future, perhaps more large models will be trained and run based on the JAX platform. Models Recently, the Keras team benchmarked three backends (TensorFlow, JAX, PyTorch) with the native PyTorch implementation and Keras2 with TensorFlow. First, they select a set of mainstream

Tesla robots work in factories, Musk: The degree of freedom of hands will reach 22 this year! May 06, 2024 pm 04:13 PM

The latest video of Tesla's robot Optimus is released, and it can already work in the factory. At normal speed, it sorts batteries (Tesla's 4680 batteries) like this: The official also released what it looks like at 20x speed - on a small "workstation", picking and picking and picking: This time it is released One of the highlights of the video is that Optimus completes this work in the factory, completely autonomously, without human intervention throughout the process. And from the perspective of Optimus, it can also pick up and place the crooked battery, focusing on automatic error correction: Regarding Optimus's hand, NVIDIA scientist Jim Fan gave a high evaluation: Optimus's hand is the world's five-fingered robot. One of the most dexterous. Its hands are not only tactile

FisheyeDetNet: the first target detection algorithm based on fisheye camera Apr 26, 2024 am 11:37 AM

Target detection is a relatively mature problem in autonomous driving systems, among which pedestrian detection is one of the earliest algorithms to be deployed. Very comprehensive research has been carried out in most papers. However, distance perception using fisheye cameras for surround view is relatively less studied. Due to large radial distortion, standard bounding box representation is difficult to implement in fisheye cameras. To alleviate the above description, we explore extended bounding box, ellipse, and general polygon designs into polar/angular representations and define an instance segmentation mIOU metric to analyze these representations. The proposed model fisheyeDetNet with polygonal shape outperforms other models and simultaneously achieves 49.5% mAP on the Valeo fisheye camera dataset for autonomous driving

See all articles