DeepSeek-V3 Explained 1: Multi-head Latent Attention-AI-php.cn

DeepSeek-V3 Detailed Explanation Series of Articles: Potential Attention Mechanism of Bulls (MLA)

This article is the first article in the "Detailed Explanation of DeepSeek-V3" series. We will explore in-depth DeepSeek's latest open source model, DeepSeek-V3 [1, 2].

This series of articles will cover two main topics:

The main architectural innovations of DeepSeek-V3 include MLA (Bulner Potential Attention) [3], DeepSeekMoE [4], load balancing without auxiliary loss [5], and multi-tagged prediction training.
DeepSeek-V3 training process, including pre-training, fine-tuning and reinforcement learning alignment phases.

This article mainly focuses on the multi-head potential attention mechanism (MLA), which was originally proposed in the development of DeepSeek-V2 and was applied in DeepSeek-V3.

Catalog:

Background: We will start with standard MHA and explain why key-value cache is needed during the inference phase, how MQA and GQA try to optimize it, and how RoPE works, etc.
Bules Potential Attention Mechanism (MLA): In-depth introduction to MLA, including its motivations, why it needs to decouple RoPE and its performance.
References.

Background

To better understand MLA and make this article one-in-one, we will review several related concepts in this section before delving into the details of MLA.

MHA in Decoder Transformer

Note that MLA is developed to speed up the inference of autoregressive text generation, so the MHA discussed in this context is for the Decoder-only Transformer.

The following figure compares three Transformer architectures for decoding, where (a) shows the encoder and decoder proposed in the original "Attention is All You Need" paper. The decoder part is then simplified by [6] to obtain the decoder-only Transformer model shown in (b), which was later used by many generative models such as GPT [8].

Today, large language models more often choose the structure shown in (c) for more stable training, apply normalization on inputs rather than outputs, and upgrade LayerNorm to RMS Norm. This will serve as the baseline architecture we discuss in this article.

In this context, MHA calculations largely follow the process in [6], as shown in the figure below: DeepSeek-V3 Explained 1: Multi-head Latent Attention

Suppose we have DeepSeek-V3 Explained 1: Multi-head Latent Attention n_h attention heads, and the dimension of each attention head is expressed as d_h, so the connected dimension will be (h_n · d_h).

For models with

l layer, if we represent the input of the t-th tag in that layer as h_t with a dimension of d, then we need to Use a linear mapping matrix to map the dimensions of h_t from d to (h_n · d_h).

More formally, we have (eq from [3]):

DeepSeek-V3 Explained 1: Multi-head Latent Attention where W^Q, W^K and W^V are linear mapping matrices:

After mapping, split the

q_t DeepSeek-V3 Explained 1: Multi-head Latent Attention , k_t and v_t into n_ht to calculate the proportional dot product. Force:

where

W^O DeepSeek-V3 Explained 1: Multi-head Latent Attention is another projection matrix used to map dimensions inversely from (h_n · d_h) to d:

Note that the above described procedures in Eqn.(1) to (8) are only for a single marker. During the reasoning process, we need to repeat this process for each newly generated markup, which involves a lot of repeated calculations. This leads to a technique called key-value caching.

DeepSeek-V3 Explained 1: Multi-head Latent Attention Key-value cache

As the name suggests, key-value caching is a technique designed to speed up the autoregression process by caching and reusing previous keys and values, rather than recalculating them in each decoding step.

Note that KV cache is usually only used in the inference phase, because during training we still need to process the entire input sequence in parallel.

KV cache is usually implemented as a rolling buffer. In each decoding step, only the new query Q is calculated, and the K and V stored in the cache will be reused in order to calculate attention using the new Q and reused K and V. At the same time, the new tagged K and V will also be added to the cache for later use.

However, the acceleration brought by key-value cache comes at the cost of memory, because key-value caches usually vary with

batch size × sequence length × hidden size × head count

when we have larger This can cause memory bottlenecks when the batch size or sequence is longer. This further leads to two techniques designed to address this limitation: multi-query attention and group-query attention.

Multiple-query attention (MQA) vs. Group-query attention (GQA)

The following figure shows the comparison between original MHA, grouped query attention (GQA) [10] and multi-query attention (MQA) [9].

The basic idea of MQA is to share single keys and single value headers across all query headers, which can significantly reduce memory usage but will also affect attention accuracy.

DeepSeek-V3 Explained 1: Multi-head Latent Attention GQA can be regarded as an interpolation method between MHA and MQA, where only one pair of keys and value headers will be shared by only one set of query headers, not all queries. But this will still lead to worse results than MHA.

In the later chapters, we will see how MLA manages to balance memory efficiency and modeling accuracy.

RoPE (rotating position embed)

The last background we need to mention is RoPE [11], which directly encodes position information into the attention mechanism by rotating the query and key vectors in the multi-head attention using a sine function.

More specifically, RoPE applies the position-dependent rotation matrix to each tag's query and key vector and uses sine and cosine functions as its basis, but applies them in a unique way to implement Rotate.

To understand what makes it position-dependent, consider a toy embedding vector with only 4 elements, i.e. (x_1, x_2, x_3, x_4).

To apply RoPE, we first group the continuous dimensions into pairs:

Then we apply the rotation matrix to rotate each pair:

where θ = θ(p) = p ⋅ θ_0, and θ_0 is the fundamental frequency DeepSeek-V3 Explained 1: Multi-head Latent Attention . In our 4D toy example, this means (x_1, x_2) will rotate θ_0, while (x_3, x_4) will rotate 2 ⋅ θ_0. This is why we call this rotation matrix

position-related

: at each position (or each pair), we will apply a different rotation matrix where the rotation angle is determined by the position . RoPE is widely used in modern large language models due to its efficiency in encoding long sequences, but as we can see from the above formula, it has positional sensitivity to both Q and K, which makes it in some These aspects are incompatible with MLA.

Bules' potential attention mechanism

Finally, we can continue to discuss the MLA section. In this section, we will first elaborate on the advanced ideas of MLA and then dive into why it requires modification of RoPE. Finally, we will also introduce the detailed algorithms and their performance of MLA.

MLA: Advanced Thoughts

The basic idea of MLA is to compress attention input

h_t

into a low-dimensional latent vector whose dimension is d_c, where d_c is much smaller than the original (h_n · d_h). Later, when we need to calculate attention, we can map this latent vector back to high-dimensional space to restore keys and values. Therefore, only the potential vector needs to be stored, thereby significantly reducing memory usage. This process can be described more formally with the following equation, where

c^{KV}_t

is the latent vector, and W^{DKV} is to h_t is mapped from (h_n · d_h) to the compression matrix of d_c (D in the superscript here represents "dimensionality reduction projection", which means compression , and W^{UK} and W^{UV} are both upper projection matrices that map shared latent vectors back to high-dimensional space. Similarly, we can map the query to a potential low-dimensional vector and then map it back to the original high-dimensional space:

DeepSeek-V3 Explained 1: Multi-head Latent Attention ### Why do you need to decouple RoPE

As mentioned earlier, RoPE is a common choice for training generative models to handle long sequences. If we apply the above MLA policy directly, this will be incompatible with RoPE.

To see this more clearly, consider what happens when we calculate attention using Eqn. (7): When we transpose the q vs. kk When multiplying, the matrix W^Q and W^{UK} will appear in the middle, and their combination is equivalent to from d_c to d A single mapping dimension for

In the original paper [3], the author describes it as W^{UK} which can be absorbed by "" into W^Q, Therefore, we do not need to store W^{UK} in the cache, thereby further reducing memory usage.

However, this is not the case when we consider the rotation matrix in Figure (4), because RoPE will apply the rotation matrix to the left of W^{UK} and this rotation matrix will eventually Located between W^Q and W^{UK} transposed.

As we explained in the background section, this rotation matrix is position-dependent, which means that the rotation matrix for each position is different. Therefore, W^{UK}** cannot be absorbed by W^Q********************************************************************** can no longer be absorbed by

W^Q**

. To resolve this conflict, the authors proposed what they called "Decoupled RoPE", by introducing additional query vectors as well as shared key vectors, and using these extra vectors only in the RoPE process, at the same time Keep the original keys in isolation from the rotation matrix to some extent.

The entire MLA process can be summarized as follows (Equation number is reused from Appendix C of [3]):

DeepSeek-V3 Explained 1: Multi-head Latent Attention where

Eqn. (37) to (40) describe how query tags are processed.
Eqn. (41) and (42) describe how key markers are handled.
Eqn. (43) and (44) describe how to use additional shared keys for RoPE, please note that the output of (42) is not involved in RoPE.

In this process, you only need to cache the blue variables with boxes. This process can be explained more clearly using the following flowchart:

### Performance of MLA DeepSeek-V3 Explained 1: Multi-head Latent Attention

The following table compares the number of elements required for KV cache (per tag) and the modeling capabilities between MHA, GQA, MQA and MLA, which shows that MLA can indeed achieve better between memory efficiency and modeling capabilities Good balance.

Interestingly, MLA's modeling capabilities even surpass those of the original MHA.

DeepSeek-V3 Explained 1: Multi-head Latent Attention More specifically, the following table shows the performance of MHA, GQA and MQA on the 7B model, where MHA has significantly better performance than MQA and GQA.

The authors of

DeepSeek-V3 Explained 1: Multi-head Latent Attention [3] also analyzed MHA and MLA, and the results are summarized in the table below, where MLA achieved better results overall.

References

[1] DeepSeek
[2] DeepSeek-V3 Technical Report
[3] DeepSeek-V2: A powerful, cost-effective hybrid expert language model
[4] DeepSeekMoE: Towards the Final Expert Specialization in Hybrid Expert Language Model
[5] Unassisted Loss Load Balancing Strategy for Hybrid Expert Model
[6] Attention Is All You Need
[7] Generate Wikipedia by summarizing long sequences
[8] Improve language understanding through generative pre-training
[9] Fast Transformer decoding: a write head is enough
[10] GQA: Training a generalized multi-query Transformer model from a multi-head checkpoint
[11] RoFormer: Enhanced Transformer with Rotating Position Embed

The above is the detailed content of DeepSeek-V3 Explained 1: Multi-head Latent Attention. For more information, please follow other related articles on the PHP Chinese website!