Linearizing Llama-AI-php.cn

Linearizing Llama

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Release： 2025-02-25 18:38:08

Original

166 people have browsed it

This article explores replacing softmax self-attention in the Llama-3.2-1B language model with a hybrid approach combining softmax sliding window and linear attention. This aims to improve inference speed without significant accuracy loss, reducing the cost of using large language models.

The project is based on the research in "LoLCATs: On Low-Rank Linearizing of Large Language Models," "An Empirical Study of Mamba-based Language Models," and "Linearizing Attention." It focuses on replacing 50% of the self-attention layers in a pre-trained Llama model.

The process is divided into four parts:

Hybrid Attention Block: This section details the creation of a custom attention block that combines sliding window and linear attention mechanisms, using learnable factors to balance their contributions. The sliding window approach limits attention to a specified window size, improving efficiency. Linear attention, applied to earlier tokens, further optimizes computation.
Attention Transfer: This stage leverages the "LoLCATs" methodology. The weights from the original Llama attention blocks are used to initialize the hybrid blocks. Training involves a forward pass with a training input, calculating the MSE loss between the original and hybrid block outputs, and fine-tuning the hybrid blocks to mimic the original's behavior.
LoRA Finetuning: Low-Rank Adaptation (LoRA) is employed to fine-tune the hybrid attention blocks within the larger Llama model. This step focuses on training the parameters of the hybrid blocks while keeping other parameters frozen, optimizing the model for text generation using the Dolly-15K dataset.
Evaluation: The hybrid model's performance is evaluated against the original Llama-3.2-1B model. Benchmarking focuses on inference speed (tokens per second and seconds per token) and accuracy (using the MMLU benchmark).

Linearizing Llama

The results show that the hybrid model offers significant speed improvements, particularly for longer sequences, while maintaining comparable accuracy on the MMLU benchmark. However, the study also highlights the significant impact of GPU hardware on both speed and accuracy measurements. Further research is suggested to explore the impact of different hardware on benchmark results.

Linearizing Llama

The conclusion emphasizes the potential of hybrid attention mechanisms as a cost-effective approach to improving LLM inference speed. The study also notes the need for further optimization of linear attention architectures and the importance of considering hardware limitations when evaluating model performance. The code for this project is available at Linearizing-Llama-3.2-1B.

License References:

[1] fineweb-edu: ODC-By v1.0 [2] Dolly-15K: CC BY-SA 3.0 [3] MMLU: MIT license

The above is the detailed content of Linearizing Llama. For more information, please follow other related articles on the PHP Chinese website!