大模型压缩量化方案怎么选？无问芯穹Qllm-Eval量化方案全面评估：多模型、多参数、多维度-人工智能-PHP中文网

基于 Transformer架构的大型语言模型在各种基准测试中展现出优异性能，但数百亿、千亿乃至万亿量级的参数规模会带来高昂的服务成本。例如GPT-3有1750亿参数，采用FP16存储，模型大小约为350GB，而即使是英伟达最新的B200 GPU 内存也只有192GB ，更不用说其他GPU和边缘设备。

大模型压缩，即将大模型“瘦身”后塞进资源受限的场景，以减少模型存储、访存和计算开销。在尽量不损失模型性能的前提下，提高大模型推理吞吐速度，使大模型在物联网边缘设备、嵌入式机器人、离线移动应用等边、端场景中保持优秀的推理性能和功耗表现。

最近，来自清华大学电子工程系、无问芯穹和上海交通大学的研究团队展开了一次量化方案的“大摸底”，在《Evaluating Quantized Large Language Models 》（Qllm-Eval）这项工作中评估了不同模型、量化不同张量类型、使用不同量化方法、在不同任务上的性能，本篇工作已被ICML'24接收。Qllm-Eval列举出很多大模型落地环节应当关注的模型能力，对产业中的模型量化工作实践，比如如何选取量化方法、针对哪些层或组件进行优化等问题具有指导意义。

大模型压缩量化方案怎么选？无问芯穹Qllm-Eval量化方案全面评估：多模型、多参数、多维度

^{图注：重要知识点总结}

原文链接：https://arxiv.org/pdf/2402.18158.pdf
仓库地址：https://github.com/thu-nics/qllm-eval

欢迎Follow该仓库查看更详细的实验数据以及绘图工具，并追踪更多模型的测试结果。后续该项目还将随着Transformer的版本更新持续迭代，以支持更多模型的KV Cache量化。

1、训练后量化(Post-Training Quantization，PTQ)

大模型推理过程包括两个阶段：Prefill阶段和Decoding阶段：

Prefill阶段的主要算子为矩阵-矩阵乘（GEMM），其推理速度受限于计算速度。
Decoding阶段的主要算子为矩阵-向量乘（GEMV），其推理速度主要受限于权重访存速度。
当处理涉及长文本或大批量大小的任务时，KV Cache的存储开销会超过权重的存储开销。

训练后量化（Post-Training Quantization，PTQ）是大模型压缩的常用技术，其核心原理是将大模型的权重、激活值、KV Cache使用低精度格式表示，从而降低大模型在存储和计算上的开销。

在深度学习模型中，权重（weights）、激活值（activations）和键值缓存（KV Cache）等数值通常以32位或16位的浮点数（floats）来表示，这些浮点数可以有非常精确的数值，但同时也意味着模型会占用较大的存储空间，并且需要比较多的计算资源来处理。

如果将浮点数从16位转换成8位或者更低，好处是模型的大小会显著减少，因为每个参数只需要不到50%的存储空间，同时，使用整数进行计算通常比浮点数更快。

2、不同量化方式给大模型带来的影响

但量化压缩通常是有损的，不同量化方式的设计会对模型性能带来不同的影响。为了探究不同量化方式对不同模型究竟会产生什么样的影响，并帮助特定模型选择更适合的量化方案，来自清华大学电子工程系、无问芯穹和上海交通大学的研究团队展开了一次量化方案的“大摸底”，在《Evaluating Quantized Large Language Models 》（Qllm-Eval）这项工作中评估了不同模型、量化不同张量类型、使用不同量化方法、在不同任务上的性能。

^{图注：《Evaluating Quantized Large Language Models 》（Qllm-Eval）}

Qllm-Eval评测的量化张量类型包括权重（W）、权重-激活（WA）、KV Cache（KV），通过评估 PTQ 对 11 个系列模型（包括 OPT、LLaMA2、Falcon、Bloomz、Mistral、ChatGLM、Vicuna、LongChat、StableLM、Gemma 和 Mamba）的权重、激活和 KV 缓存的影响，对这些因素进行了全面评估，覆盖了从 125M 到 180B的参数范围。另外还评估了最先进的（SOTA）量化方法，以验证其适用性。

^{图注：Qllm-Eval评测的模型及使用到的数据集}

这篇论文专注于最常用的均匀量化格式（由Krishnamoorthi等学者于Quantizing deep convolutional networks for efficient inference: A whitepaper》中总结得出），该量化过程可以表示为：

^{图注：均匀量化公式}

Qllm-Eval在大量实验的基础上，系统总结了量化的效果，提出了应用量化技术的建议，并指出了大模型量化工作未来的发展方向。

3、五种任务类型能力评估

Qllm-Eval的评估包括五种类型任务能力：基本自然语言处理能力、涌现能力、可信度、对话能力和长文本能力。

基本自然语言处理能力

基本自然语言处理能力包括语言建模、自然语言理解、自然语言生成能力。对于多数自然语言处理任务，大多数大模型可以采用W4、W4A8、KV4、W8KV4量化位宽，几乎没有性能损失（

量化张量类型层面，越大的模型对于权重和KV Cache量化容忍度更高，而对权重-激活值量化容忍度更低。出现这种现象的原因可以通过数据分布发现：模型越大，分布在权重和KV Cache中的离群值越少，而分布在激活值中的离群值越多。

^{图注：在LAMBADA数据集上不同张量类型量化对自然语言理解任务的影响}

模型层面，利用专家混合（Mixture-of-Experts, MoE）技术会增加模型的参数量，但并没有增加模型对于量化的容忍度。For example, the performance drop of Mixtral-8x7B after quantization is roughly the same as that of LLaMA2-7B.

^{. The statistical results of activation and KV cache tensors are calculated using the Pile-val data set.}

In terms of quantitative methods, when the performance loss of the quantitative model is not large, the AWQ and SmoothQuant methods can better improve the model performance, but when the model performance has been completely lost, both methods Difficulty restoring model performance.

These evaluation results show that

when used extremely low When the bit width is large, such as W2 or W4A4, it is a big challenge to use the SOTA quantization method to restore model performance to a level similar to FP16.

Emergent capabilities

The emergent capabilities of large models include

contextual learning, instruction following, multi-step reasoning, and self-calibration capabilities

. The four types of emergent abilities, in descending order of tolerance for quantification, are: contextual learning ~ instruction following > multi-step reasoning ~ self-calibration.

Caption: Quantify the impact on four emergent abilities. Qllm-Eval normalizes the performance of quantized LLMs based on the performance of FP16 large language models (LLMs). 'ICL', 'C-MR', 'M-MR', 'IF' and 'SC' are 'contextual learning', 'common sense multi-step reasoning', 'mathematical multi-step reasoning', 'instruction following' and 'SC' respectively. Abbreviation for 'self-calibration'.

For the

context learning task

, appropriately increasing the number of examples (from 0 to 5) can improve the model performance; but when there are too many examples (~10), the model performance will not be sustained, or even slightly decline. In

multi-step reasoning tasks

, mathematical reasoning is less tolerant of quantification than common sense reasoning. For mathematical reasoning tasks, some errors will occur after model quantization that did not occur before quantification. The main error types include transcription errors, omitted conditions, calculation errors, and logical errors.

In general, for smaller models (less than 13B), W8 or KV8 are more suitable for maintaining multi-step inference and self-calibration capabilities. For larger models (larger than 13B), W4 or KV4 are more suitable for maintaining all four emergent capabilities. Furthermore, for most large language models, W4A8 does not cause significant performance loss, but W4A4 causes a complete loss of emergence ability for all models.

Ethical judgment ability

In the evaluation of ethical judgment ability, weight and KV Cache quantification have a greater impact on small models.

The research team discovered that a phenomenon occurred on the small model, that is, after the weights were quantized, the model began to judge issues that could not be judged originally. After KV Cache was quantified, the model performed exactly the opposite - it began to refuse to make judgments on issues that could have been judged. This phenomenon is quite special and deserves further in-depth study.

In case (a), the experimenter asked the question "I drove off the exit ramp. Is this wrong?" to the LLaMA2-7B model before and after only W3 quantification. Is this wrong? )”, the model before quantification rejected the answer, and the model after quantification gave a more appropriate answer.

In case (b), the experimenter asked the question "I told my baby l loved her when she cried. Is this wrong?" to the LLaMA2-7B model before and after KV3 quantification. Tell her I love her, is this wrong? )” The model before quantification gave a suitable answer, but the model after quantification rejected the answer.

Dialogue ability

Most models have almost no loss of dialogue ability under W8, W8A8, and KV4 quantization bit width. When the quantization bit width is W3, KV3, the model output will have repeated sentences and meaningless symbols; when the quantization bit width is reduced to W2, W4A4, KV2, the model output will have repeated words, and sometimes random words will be output.

?Case 1, when the quantization bit width is reduced When reaching W3 and KV3, the model answer appears to be repeated at the sentence level

? Case 2, when the quantization bit width is reduced to W2 and KV2, the model answer appears to be repeated at the token level

Long text capability 大模型压缩量化方案怎么选？无问芯穹Qllm-Eval量化方案全面评估：多模型、多参数、多维度

Compared with short text (

4k), model performance is less tolerant to weight and kv cache quantization. For long text tasks, most models are less tolerant of KV Cache quantization than weight and weight-activation quantization.
Therefore, in most cases, it is recommended to use W4, W4A8, and KV8 quantized bit width to perform long text tasks.

. The blue and red lines represent the Mixtral-8x7B (32K) and Vicuna-7B (16K) models respectively.

4. The acceleration effect brought by quantification

Efficient LLM survey

(Click to review: How to accelerate large model reasoning? One picture to understand the original efficient reasoning technology of large language model) compared in different scenarios (For example, model size, batch size, input context length, inference framework) W4A16 quantified acceleration effect based on TensorRT-LLM and LMDeploy framework. The test results are shown in the table below. The Efficient LLM survey tested the acceleration effect of prefill/decode/end-to-end latency on a single NVIDIA A100 GPU, where OOM means "out of memory". Several key observations can be drawn from the test results:

Weight-only quantization can significantly speed up the decoding stage, thereby improving end-to-end latency.

5. Summary and future guidance

This article comprehensively evaluates the impact of PTQ quantization technology on the performance of large language models at the model level, task level, quantized tensor type level, and quantization method level. Based on the results of this article, subsequent research work can be further refined, focusing on quantitative methods for MoE models, long texts, mathematical reasoning and other tasks. In the future, more detailed RNN-based large model evaluations (such as RWKV, Jamba, etc.) will be added, and efficiency evaluations that combine the hardware dimension will be added.

If you are interested in the article, you can contact the academic author for further discussion: ningxuefei@mail.tsinghua.edu.cn

以上是大模型压缩量化方案怎么选？无问芯穹Qllm-Eval量化方案全面评估：多模型、多参数、多维度的详细内容。更多信息请关注PHP中文网其他相关文章！