The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com
As deep learning large language models become more and more popular, large language models become larger and larger, making their reasoning Costs have also gone up. Model quantification has become a popular research topic.
Recently, ByteDance has launched a new quantification idea, abandoning the traditional quantification paradigm and modeling quantification tasks from the perspective of mathematical optimization. The article is posted on arXiv, and the code has been open sourced. All results in the article can be reproduced with one click. This quantification idea is based on mathematical optimization, modeling the quantification task from the perspective of mathematical optimization, and finding the optimal solution by maximizing the objective function or minimizing the loss function. This idea has achieved good results in experiments and achieved satisfactory results.
Paper link: https://arxiv.org/abs/2404.12759
Project link: https://github.com/bytedance/decoupleQ
W2 operator: https://github.com/NVIDIA/TensorRT-LLM/pull/1568
1. Background
The rapid development of large-scale technology has made the cost of reasoning higher and higher. Model quantification, as a technical solution to reduce inference costs, has received more and more attention and research. However, under the traditional quantization paradigm, the accuracy of the model drops rapidly at very low bits. Based on this, the authors proposed a new quantification idea, decoupling the model parameters into an integer part and a floating point part, and modeling the quantification task from the perspective of mathematical optimization, so that the model can still maintain Higher accuracy. The advantage of this is obvious. We no longer need to focus on quantization-specific issues, such as how to deal with sensitive channels, how to deal with outliers, etc. Instead, we only need to mathematically model the quantification problem, find a suitable optimization objective function, and then to solve this function.
2. Traditional quantification
Traditionally, our quantification idea for a model is:
where , is the floating point weights of the model before quantization; s and z are a linear transformation coefficient, indicating scale and zero point; α and β are the upper and lower bounds of the integer representation range. For example, for int4 quantization, α = - 8, β = 7; represents the rounding function, which is generally rounded to the nearest integer.
Regarding the values of s and z, generally speaking, for asymmetric quantization, we can take:
In this way, one will be distributed in ## The floating point weights of # are linearly mapped to the interval range of .
In inverse quantization, the following formula is generally used: In this traditional quantization scheme, we need to pay attention to many detailed issues unique to quantization. , for example, for sensitive channels, we have sensitive channel processing methods; for outliers, we have outlier processing methods. This processing paradigm of treating headaches and treating headaches is difficult to cope with complex and ever-changing business scenarios. Bytedance researchers try to abstract these issues and look at quantification issues from a macro perspective. We only need to establish an abstract optimization objective function and then solve this objective function.3.decoupleQ
Observing the role of equations (1)~(3) in quantification, if we change our thinking, we will find that we actually do not need to know equations (1) and (2). After we quantify a large model and deliver it to downstream engine students, we only need to know and (s,z) in equation (3). In other words, (s,z) in equation (3) can be regarded as the coefficient of an ordinary affine transformation, and there is no need to retain its meaning in equation (2). The affine transformation coefficient can be obtained through mathematical optimization methods.
Further exploration in formula (3), we can decouple the parameters of a large model into the integer part and the floating point part (s,z). After such decoupling, the process of model quantization can be regarded as a process of solving the integer part and the floating point part (s,z) of the model. We can alternately optimize the solution. To this end, the optimization objective function and its constraints must be determined.
For a linear layer, we can construct the following optimization objective function:
Where, is the input of the layer, is a Symmetric matrix (if none of the columns of X are all zero, then H is a positive definite symmetric matrix).
Generally speaking, in order to improve the quantization accuracy, we can use per-channel quantization on the weight of the model. In per-channel quantization, when optimizing equation (4), each column of is optimized independently. So we only need to focus on one of the columns.
At this point, the optimization goal can be written as follows: (For the sake of simplicity of notation, the symbols are redefined in the article):
The optimization objective function is
Among them, w is a certain column in , and b is the corresponding column in . The definitions of other symbols are the same as before.
In fact, the optimization objective functions (6) and (4) are completely consistent, is the inverse quantization process.
Converting a quantitative problem into a mathematical optimization problem in the form of (5) is the key that distinguishes decoupleQ from traditional quantitative papers. This transformation allows us to only focus on solving equation (5), and no longer need to deal with the minutiae of quantification itself, such as outlier, etc.
It is not easy to solve equation (5) because there are constraints on , especially the non-convex constraint . In the paper, the author gives an alternative solution idea, that is, after obtaining good initialization about (s,z) and w, iteratively solve (s,z) and w alternately. When solving (s,z), notice that equation (5) is an unconstrained quadratic form with respect to (s,z). You can directly derive the objective function and make the derivative zero to obtain the analytical solution. When solving w, the author adopts two levels of approximation. The first level approximation has higher convergence, but the solution is slow; the second level approximation samples the idea of GPTQ [1], which has slightly worse convergence, but the solution is faster.
In order to further improve the accuracy of the quantization model, the author pointed out that in addition to mse minimization at the layer level, mse minimization can also be done at the block level, that is:
In this step, the author quantizes each linear layer at a transformer block level, fixes their integer part, and fine-tunes the floating point part (s, z) and the related parameters of the layer norm. . Experimental results show that this step of fine-tuning can further improve the accuracy of the model.
4. W2 operator implementation
To perform inference on the quantized model, the support of quantized operators is required. There is no ready-made w2a16 operator available in the industry. , the authors developed the Gemm cuda kernel of w2 based on the w4 operator in Tensorrt-LLM, realizing efficient inference of the w2a16 model.
The quantization model itself is loaded and stored in the video memory in the form of 2bit weight, so it will occupy a relatively small video memory. Our cuda kernel loads the 2-bit weight into the register at runtime, and then uses hardware instructions to efficiently convert it into bf16 form and perform gemm operations with activation. Because our scenario is limited by latency, the batchsize in the generation stage is relatively small. At this time, matrix multiplication is limited by weight memory access. This implementation will greatly reduce the amount of memory access and improve the performance of the model. During the implementation process, algorithm search and SpiltK Parallel Reduce are combined to further improve the performance of the model. According to actual measurements, when batchsize=1, the performance of w2a16 Gemm on the L card can be improved by 1.4x-1.7x compared to w4a16.
Operator link: https://github.com/NVIDIA/TensorRT-LLM/pull/1568
# The implementation principle of kernel
5. Experiment
The author gives Bytedance’s internal ASR experimental results and open source experiments in the article Comparison results: The internal experimental results are:In this table, the author uses word err rate (WER) to measure the accuracy of ASR. The authors attempted to quantify the model to W2A16g64 using different methods. The wer of the floating-point model before quantization is 6.68%. After quantization using GPTQ [1], it is 6.83%. The wer of decoupleQ with block minimization after quantization is 6.70%. This result is very similar to the wer of the floating-point model before quantization. near. It also reports the time required for quantification. The price of high quantization accuracy is that quantization takes a long time. In actual business, after using decoupleQ to quantify the model, the integer part is fixed, and the labeled data set is used to fine-tune the scale and zero, and the accuracy of the model is further improved.
The results of the open source comparison experiment are:
This table is a comparison of the quantitative results of decoupleQ and other methods on Llama-1/2. Perplexity (PPL) is used as the evaluation index. It can be seen that under the same quantization configuration, the PPL of deoucpleQ will be lower than other methods most of the time.
6. Business benefits
decoupleQ Quantification technology is now widely used in ByteDance’s voice sector. It has been launched in speech generation models (Text-to-Speech), speech recognition models (automatic speech recognition), etc., and has been implemented in products such as Doubao, Feishu, and Douyin. A large number of online businesses show that based on decoupleQ quantification, the inference accuracy of W4A16 is completely on par with fp16/bf16 inference; the accuracy of W2A16 is only slightly worse than the fp16/bf16 accuracy (after the floating point part sft, the accuracy is on the same level as fp16/bf16) ). Although the paper only introduces weight-only quantification, in actual business, after weight is well quantified, activation quantification can be much simpler.
Compared with fp16, w8fp16, and w4fp16, we have achieved good acceleration effects in terms of hardware acceleration. In small batches, the performance of w2 matrix multiplication is 5-6 times higher than that of fp16, and 1.5-1.7 times higher than that of w4. . In terms of internal business models, w2fp16 has a performance improvement of 3-5 times compared to fp16, and a performance improvement of 1.25-1.4 times compared to w4fp16. It will also significantly reduce the memory occupied by the model weight, providing better memory utilization for the runtime. Lots of space.
7. Summary and Discussion
In the summary and discussion section, the author also pointed out Two current risks of decoupleQ are eliminated:
1. decoupleQ aims to use mathematical optimization methods to minimize the L2 loss before and after quantization. However, minimizing L2 loss at the layer level or block level may not necessarily represent the optimal accuracy of the final model;
2. During the optimization process of equations (5) and (7), when solving## When # and (s,z), only a small part of the calibration data is solved, which makes decoupleQ easy to overfit the calibration data.
Nonetheless, the author also pointed out that the idea of decoupling the model parameters into the integer part and the floating point part is very meaningful. If a labeled data set exists, we can fix the integer part after quantization and use the labeled data set to specifically train (s, z) to further improve the accuracy of the model. This not only ensures the generalization performance of the model (derived from the fixed integer part), but also can exert its ability on specific subtasks (derived from the fine-tuned floating point part). In ByteDance's actual business, after the previous version of the model is quantified and put online, when the next version is updated, only the floating point part of the model can be trained.
References:
【1】Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Optq : Accurate quantization for generative pretrained transformers. In The Eleventh International Conference on Learning Representations, 2022.
##【2】Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. Omniquant: Omnidirectionally calibrated quantization for large language models. arXiv preprint arXiv:2308.13137, 2023
【3】Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023.
The above is the detailed content of New ideas for quantification of byte open source large models, the accuracy of the 2-bit quantization model is on par with fp16. For more information, please follow other related articles on the PHP Chinese website!