Task universality is one of the core goals of basic model research, and it is also the only way for deep learning research to lead to advanced intelligence. In recent years, thanks to the universal key modeling capabilities of the attention mechanism, Transformer has performed well in many fields and has gradually shown a trend of universal architecture. However, as the length of the sequence increases, the calculation of the standard attention mechanism exhibits quadratic complexity, which seriously hinders its application in long sequence modeling and large models.
To this end, a team from the School of Software, Tsinghua University deeply explored this key issue and proposed a task-universal linear complexity backbone network Flowformer, while maintaining the versatility of the standard Transformer. At the same time, its complexity was reduced to linear, and the paper was accepted by ICML 2022.
## Author list: Wu Haixu, Wu Jialong, Xu Jiehui, Wang Jianmin, Long Mingsheng
Link: https://arxiv.org/pdf/2202.06258.pdf
Code: https://github.com/thuml/ Flowformer
Compared with the standard Transformer, the Flowformer model proposed in this article has the following characteristics:
The standard attention mechanism input contains three parts: queries(), keys() and values(), and its calculation method As follows: where is the attention weight matrix, and the final calculation result will be obtained by weighted fusion. The computational complexity of the above process is. It is noted that there have been many studies on the problem of continuous multiplication of multinomial matrices in classical algorithms. In particular, for the attention mechanism, we can use the associative law of matrix multiplication to achieve optimization, for example, the original quadratic complexity can be reduced to linear. But the function in the attention mechanism makes it impossible to apply the associative law directly. Therefore, how to remove functions in the attention mechanism is the key to achieving linear complexity. However, much recent work has demonstrated that functions play a key role in avoiding trivial attentional learning. In summary, we look forward to a model design solution that achieves the following goals: (1) remove functions; (2) avoid trivial attention; (3) maintain the versatility of the model.
2. MotivationIn view of goal (1), in previous work, the kernel method is often used to replace the function, that is, through approximate attention calculation (for non- linear function), but removing it directly would cause trivial attention. To this end, for goal (2), previous work had to introduce some inductive preferences, which limited the versatility of the model , and therefore did not meet goal (3), such as the locality assumption in cosFormer.
Competition mechanism in SoftmaxIn order to meet the above goals, we analyze it based on the basic properties of . We note that it was originally proposed to extend the "winner-take-all" maximum operation into a differentiable form. Therefore, thanks to its inherent "competition" mechanism, it can differentiate the attention weights between various tokens, thereby avoiding ordinary attention problems. Based on the above considerations, we try to introduce the competition mechanism into the attention mechanism design, so as to avoid the trivial attention problems caused by kernel method decomposition.
Competition mechanism in network flowWe pay attention to the "Conservation"## in the classic network flow (Flow network) model in graph theory. #(Conservation) is an important phenomenon, that is, the inflow of each node is equal to the outflow. Inspired by "Fixed resources will inevitably cause competition", in this article, we try to re-analyze the information flow in the classic attention mechanism from the perspective of network flow, and convert competition through conservation properties Introduce attention mechanism design to avoid ordinary attention problems. 3. Flowformer
Source (source, corresponding) is gathered to sink (sink, corresponding) based on the learned flow capacity (flow capacity, corresponding attention weight).
Outside the attention mechanism, the information of the source (v) comes from the upper layer of the network, and the information of the sink (R) will also be provided to the feed-forward layer below. Based on the above observations, we can from the inflow From the two perspectives of flow and outflow, we control the interaction between the attention mechanism and the external network to achieve "fixed resources", thereby causing competition within the source and sink respectively to avoid ordinary attention. Without loss of generality, we set the amount of interaction information between the attention mechanism and the external network to the default value 1. (1) The inflow conservation of the sink (R): is not difficult to obtain. Before conservation, for the th sink, the amount of information flowing in is: . In order to fix the amount of information flowing into each sink to unit 1, we introduce as a normalization in the calculation of the information flow (attention weight). After normalization, the inflow information amount of the th sink is: #At this time, due to the conservation of the inflow of the sink, there is natural competition between the various sources (V) Relationship, we calculate the amount of information provided by each source (V) at this time, and we can get: the amount of information provided by each source under competition, which also represents the importance of each source. (2) Conservation of outflow from source (V): Similar to the aforementioned process, before conservation, for the source, the amount of information flowing out of it is: . In order to fix the amount of information flowing out of each source to unit 1, we will introduce the calculation of the information flow (attention weight) as a normalization. After normalization, the amount of outflow information from the jth source is: . At this time, due to the conservation of outflow from the source, there is a natural competition relationship between the sinks (). We calculate the amount of information received by each sink () at this time, and we can get: In the case of competition, the final required for each result is The amount of information received. (3) Overall design Based on the above results, we design the following Flow-Attention mechanism, specifically including competition (Competition), aggregation (Aggregation), and allocation (Allocation) three parts: Competition introduces the competition mechanism to highlight important information; Aggregation realizes linear complexity based on the matrix associative law; Allocation introduces the competition mechanism and transfers control to the next step. One layer of information. All operations in the above process have linear complexity. At the same time, the design of Flow-Attention only relies on the conservation principle in network flow and reintegrates information flow. Therefore, it does not introduce new inductive preferences, ensuring the versatility of the model. Flowformer is obtained by replacing the quadratic complexity Attention in the standard Transformer with Flow-Attention. This paper conducts extensive experiments on standard data sets: As shown in the table below, Flowformer performed well on all five tasks, verifying the versatility of the model. Please see the paper for detailed experimental results. In order to further explain the working principle of Flowformer, we conducted a visual experiment on the attention in the ImageNet classification task (corresponding to Flow-Attention), from which we can find: The above visualization shows that introducing competition into the attention mechanism design through Flow-Attention can effectively avoid trivial attention. More visualization experiments can be found in the paper. The Flowformer proposed in this article introduces the conservation principle in network flow into the design, and naturally introduces the competition mechanism into the attention calculation, effectively avoiding It solves the trivial attention problem and maintains the versatility of the standard Transformer while achieving linear complexity. Flowformer has achieved excellent results in five major tasks: long sequence, vision, natural language, time series, and reinforcement learning. In addition, the design concept of "no special induction preference" in Flowformer is also inspiring to the research of general infrastructure. In future work, we will further explore the potential of Flowformer for large-scale pre-training. 3.2 Flow-Attention
5. Analysis
6. Summary
The above is the detailed content of Common tasks! Tsinghua proposes backbone network Flowformer to achieve linear complexity | ICML2022. For more information, please follow other related articles on the PHP Chinese website!