DeepSeek-V3 Detailed Explanation Series of Articles: Potential Attention Mechanism of Bulls (MLA)
This article is the first article in the "Detailed Explanation of DeepSeek-V3" series. We will explore in-depth DeepSeek's latest open source model, DeepSeek-V3 [1, 2].
This series of articles will cover two main topics:
This article mainly focuses on the multi-head potential attention mechanism (MLA), which was originally proposed in the development of DeepSeek-V2 and was applied in DeepSeek-V3.
Catalog:
MHA in Decoder Transformer
The following figure compares three Transformer architectures for decoding, where (a) shows the encoder and decoder proposed in the original "Attention is All You Need" paper. The decoder part is then simplified by [6] to obtain the decoder-only Transformer model shown in (b), which was later used by many generative models such as GPT [8].
Today, large language models more often choose the structure shown in (c) for more stable training, apply normalization on inputs rather than outputs, and upgrade LayerNorm to RMS Norm. This will serve as the baseline architecture we discuss in this article.
In this context, MHA calculations largely follow the process in [6], as shown in the figure below:
Suppose we have n_h attention heads, and the dimension of each attention head is expressed as d_h, so the connected dimension will be (h_n · d_h).
l layer, if we represent the input of the t-th tag in that layer as h_t with a dimension of d, then we need to Use a linear mapping matrix to map the dimensions of h_t from d to (h_n · d_h).
More formally, we have (eq from [3]):where W^Q, W^K and W^V are linear mapping matrices:
q_t, k_t and v_t into n_ht to calculate the proportional dot product. Force:
W^O is another projection matrix used to map dimensions inversely from (h_n · d_h) to d:
Key-value cache
Note that KV cache is usually only used in the inference phase, because during training we still need to process the entire input sequence in parallel.
KV cache is usually implemented as a rolling buffer. In each decoding step, only the new query Q is calculated, and the K and V stored in the cache will be reused in order to calculate attention using the new Q and reused K and V. At the same time, the new tagged K and V will also be added to the cache for later use.
However, the acceleration brought by key-value cache comes at the cost of memory, because key-value caches usually vary with
batch size × sequence length × hidden size × head countwhen we have larger This can cause memory bottlenecks when the batch size or sequence is longer. This further leads to two techniques designed to address this limitation: multi-query attention and group-query attention.
Multiple-query attention (MQA) vs. Group-query attention (GQA)
GQA can be regarded as an interpolation method between MHA and MQA, where only one pair of keys and value headers will be shared by only one set of query headers, not all queries. But this will still lead to worse results than MHA.
In the later chapters, we will see how MLA manages to balance memory efficiency and modeling accuracy.
The last background we need to mention is RoPE [11], which directly encodes position information into the attention mechanism by rotating the query and key vectors in the multi-head attention using a sine function.
More specifically, RoPE applies the position-dependent rotation matrix to each tag's query and key vector and uses sine and cosine functions as its basis, but applies them in a unique way to implement Rotate.
To understand what makes it position-dependent, consider a toy embedding vector with only 4 elements, i.e. (x_1, x_2, x_3, x_4).To apply RoPE, we first group the continuous dimensions into pairs:
where θ = θ(p) = p ⋅ θ_0, and θ_0 is the fundamental frequency . In our 4D toy example, this means (x_1, x_2) will rotate θ_0, while (x_3, x_4) will rotate 2 ⋅ θ_0.
This is why we call this rotation matrix
: at each position (or each pair), we will apply a different rotation matrix where the rotation angle is determined by the position . RoPE is widely used in modern large language models due to its efficiency in encoding long sequences, but as we can see from the above formula, it has positional sensitivity to both Q and K, which makes it in some These aspects are incompatible with MLA.
Bules' potential attention mechanism
MLA: Advanced Thoughts
into a low-dimensional latent vector whose dimension is d_c, where d_c is much smaller than the original (h_n · d_h). Later, when we need to calculate attention, we can map this latent vector back to high-dimensional space to restore keys and values. Therefore, only the potential vector needs to be stored, thereby significantly reducing memory usage. This process can be described more formally with the following equation, where
c^{KV}_tis the latent vector, and W^{DKV} is to h_t is mapped from (h_n · d_h) to the compression matrix of d_c (D in the superscript here represents "dimensionality reduction projection", which means compression , and W^{UK} and W^{UV} are both upper projection matrices that map shared latent vectors back to high-dimensional space. Similarly, we can map the query to a potential low-dimensional vector and then map it back to the original high-dimensional space:
### Why do you need to decouple RoPE
As mentioned earlier, RoPE is a common choice for training generative models to handle long sequences. If we apply the above MLA policy directly, this will be incompatible with RoPE.
To see this more clearly, consider what happens when we calculate attention using Eqn. (7): When we transpose the q vs. kk When multiplying, the matrix W^Q and W^{UK} will appear in the middle, and their combination is equivalent to from d_c to d A single mapping dimension for
.In the original paper [3], the author describes it as W^{UK} which can be absorbed by "" into W^Q, Therefore, we do not need to store W^{UK} in the cache, thereby further reducing memory usage.
However, this is not the case when we consider the rotation matrix in Figure (4), because RoPE will apply the rotation matrix to the left of W^{UK} and this rotation matrix will eventually Located between W^Q and W^{UK} transposed.
As we explained in the background section, this rotation matrix is position-dependent, which means that the rotation matrix for each position is different. Therefore, W^{UK}** cannot be absorbed by W^Q********************************************************************** can no longer be absorbed by
W^Q**. To resolve this conflict, the authors proposed what they called "Decoupled RoPE", by introducing additional query vectors as well as shared key vectors, and using these extra vectors only in the RoPE process, at the same time Keep the original keys in isolation from the rotation matrix to some extent.
The entire MLA process can be summarized as follows (Equation number is reused from Appendix C of [3]):
where
### Performance of MLA
Interestingly, MLA's modeling capabilities even surpass those of the original MHA.
More specifically, the following table shows the performance of MHA, GQA and MQA on the 7B model, where MHA has significantly better performance than MQA and GQA.
[3] also analyzed MHA and MLA, and the results are summarized in the table below, where MLA achieved better results overall.
The above is the detailed content of DeepSeek-V3 Explained 1: Multi-head Latent Attention. For more information, please follow other related articles on the PHP Chinese website!