The Transformer architecture has been widely used in the field of modern machine learning. The key point is to focus on one of the core components of transformer, which contains a softmax, which is used to generate a probability distribution of tokens. Softmax has a higher cost because it performs exponential calculations and summing sequence lengths, which makes parallelization difficult to perform.
Google DeepMind thought of a new idea: Replace the softmax operation with a new method that does not necessarily output a probability distribution. They also observed that using ReLU divided by the sequence length can approach or rival traditional softmax when used with a visual Transformer.
Paper link: https://arxiv.org/abs/2309.08586
This result Brings new solutions to parallelization, because ReLU can be parallelized in the sequence length dimension, and requires fewer gather operations than traditional ones
The key point is to concentrate
The key point is to concentrate on the function Convert d-dimensional queries, keys and values {q_i, k_i, v_i} through a two-step process
In the first step, it is important to focus on getting the key points by Force weight :
##where ϕ is usually softmax.
The next step, using this focus is to focus on weights to calculate the output This paper explores the use of point-wise calculations as an alternative to ϕ.
The key point of ReLU is to focus on
DeepMind observed that for ϕ = softmax in Eq. 1, is a better alternative. They will use focus is called ReLU.
Expanded point-by-point focus is to focus
The researchers also experimentally explored more A wide range of choices, where α ∈ [0, 1] and h ∈ {relu,relu², gelu,softplus, identity,relu6,sigmoid}.
What needs to be rewritten is: the extension of sequence length
They also found that if using a Expanding items with sequence length L can improve accuracy. Previous research work trying to remove softmax has not used this extension scheme
Among the Transformers currently designed to focus on using softmax, there is , which means although this is unlikely to be A necessary condition, but can ensure that the complexity of during initialization is , retain this Conditions may reduce the need to change other hyperparameters when replacing softmax.
At the time of initialization, the elements of q and k are O (1), so will also be O (1). Activation functions like ReLU maintain O (1), so a factor of is needed to make have a complexity of .
Main results
Figure 1 Description In terms of ImageNet-21k training, ReLU focuses on focusing and softmax focuses on the scaling trend. The x-axis shows the total kernel computation time required for the experiment in hours. A big advantage of ReLU is that it can be parallelized in the sequence length dimension, requiring fewer gather operations than softmax.
The content that needs to be rewritten is: the effect of extending the sequence length
Figure 2 compares what needs to be rewritten: the results of the sequence length extension method and various other point-by-point solutions that replace softmax. Specifically, it is to use relu, relu², gelu, softplus, identity and other methods to replace softmax. The X-axis is α. The Y-axis is the accuracy of the S/32, S/16, and S/8 Vision Transformer models. The best results are usually obtained when α is close to 1. Since there is no clear optimal nonlinearity, they used ReLU in their main experiments because it is faster.
## The effect of qk-layernorm can be restated as follows:
The main experiments used qk-layernorm, where queries and keys are passed through LayerNorm before calculating weights. DeepMind states that the reason for using qk-layernorm by default is that it is necessary to prevent instability when scaling model sizes. Figure 3 shows the impact of removing qk-layernorm. This result indicates that qk-layernorm has little impact on these models, but the situation may be different when the model size becomes larger.
##Redescription: The additional effect of the door Previous research on removing softmax has adopted the method of adding a gating unit, but this method cannot scale with the sequence length. Specifically, in the gated attention unit, there is an additional projection that produces an output that is obtained by an element-wise multiplicative combination before the output projection. Figure 4 explores whether the presence of gates eliminates the need for rewriting what is: an extension of the sequence length. Overall, DeepMind observes that the best accuracy is achieved with or without gates, with and without gates, by requiring rewriting: Sequence length extensions. Also note that for the S/8 model using ReLU, this gating mechanism increases the core time required for the experiment by approximately 9.3%.
The above is the detailed content of ReLU replaces softmax in visual Transformer, DeepMind's new trick reduces costs rapidly. For more information, please follow other related articles on the PHP Chinese website!