


New ideas for quantification of byte open source large models, the accuracy of the 2-bit quantization model is on par with fp16

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com
As deep learning large language models become more and more popular, large language models become larger and larger, making their reasoning Costs have also gone up. Model quantification has become a popular research topic.
Recently, ByteDance has launched a new quantification idea, abandoning the traditional quantification paradigm and modeling quantification tasks from the perspective of mathematical optimization. The article is posted on arXiv, and the code has been open sourced. All results in the article can be reproduced with one click. This quantification idea is based on mathematical optimization, modeling the quantification task from the perspective of mathematical optimization, and finding the optimal solution by maximizing the objective function or minimizing the loss function. This idea has achieved good results in experiments and achieved satisfactory results.
Paper link: https://arxiv.org/abs/2404.12759
Project link: https://github.com/bytedance/decoupleQ
W2 operator: https://github.com/NVIDIA/TensorRT-LLM/pull/1568
1. Background
The rapid development of large-scale technology has made the cost of reasoning higher and higher. Model quantification, as a technical solution to reduce inference costs, has received more and more attention and research. However, under the traditional quantization paradigm, the accuracy of the model drops rapidly at very low bits. Based on this, the authors proposed a new quantification idea, decoupling the model parameters into an integer part and a floating point part, and modeling the quantification task from the perspective of mathematical optimization, so that the model can still maintain Higher accuracy. The advantage of this is obvious. We no longer need to focus on quantization-specific issues, such as how to deal with sensitive channels, how to deal with outliers, etc. Instead, we only need to mathematically model the quantification problem, find a suitable optimization objective function, and then to solve this function.
2. Traditional quantification
Traditionally, our quantification idea for a model is:
where , is the floating point weights of the model before quantization; s and z are a linear transformation coefficient, indicating scale and zero point; α and β are the upper and lower bounds of the integer representation range. For example, for int4 quantization, α = - 8, β = 7;
represents the rounding function, which is generally rounded to the nearest integer.
Regarding the values of s and z, generally speaking, for asymmetric quantization, we can take:
In this way, one will be distributed in ## The floating point weights of # are linearly mapped to the interval range of .
3.decoupleQ
Observing the role of equations (1)~(3) in quantification, if we change our thinking, we will find that we actually do not need to know equations (1) and (2). After we quantify a large model and deliver it to downstream engine students, we only need to know and (s,z) in equation (3). In other words, (s,z) in equation (3) can be regarded as the coefficient of an ordinary affine transformation, and there is no need to retain its meaning in equation (2). The affine transformation coefficient can be obtained through mathematical optimization methods.
Further exploration in formula (3), we can decouple the parameters of a large model into the integer part and the floating point part (s,z). After such decoupling, the process of model quantization can be regarded as a process of solving the integer part
and the floating point part (s,z) of the model. We can alternately optimize the solution. To this end, the optimization objective function and its constraints must be determined.
For a linear layer, we can construct the following optimization objective function:
Where, is the input of the layer,
is a Symmetric matrix (if none of the columns of X are all zero, then H is a positive definite symmetric matrix).
Generally speaking, in order to improve the quantization accuracy, we can use per-channel quantization on the weight of the model. In per-channel quantization, when optimizing equation (4), each column of is optimized independently. So we only need to focus on one of the columns.
At this point, the optimization goal can be written as follows: (For the sake of simplicity of notation, the symbols are redefined in the article):
The optimization objective function is
Among them, w is a certain column in , and b is the corresponding column in
. The definitions of other symbols are the same as before.
In fact, the optimization objective functions (6) and (4) are completely consistent, is the inverse quantization process.
Converting a quantitative problem into a mathematical optimization problem in the form of (5) is the key that distinguishes decoupleQ from traditional quantitative papers. This transformation allows us to only focus on solving equation (5), and no longer need to deal with the minutiae of quantification itself, such as outlier, etc.
It is not easy to solve equation (5) because there are constraints on , especially the non-convex constraint
. In the paper, the author gives an alternative solution idea, that is, after obtaining good initialization about (s,z) and w, iteratively solve (s,z) and w alternately. When solving (s,z), notice that equation (5) is an unconstrained quadratic form with respect to (s,z). You can directly derive the objective function and make the derivative zero to obtain the analytical solution. When solving w, the author adopts two levels of approximation. The first level approximation has higher convergence, but the solution is slow; the second level approximation samples the idea of GPTQ [1], which has slightly worse convergence, but the solution is faster.
In order to further improve the accuracy of the quantization model, the author pointed out that in addition to mse minimization at the layer level, mse minimization can also be done at the block level, that is:
In this step, the author quantizes each linear layer at a transformer block level, fixes their integer part, and fine-tunes the floating point part (s, z) and the related parameters of the layer norm. . Experimental results show that this step of fine-tuning can further improve the accuracy of the model.
4. W2 operator implementation
To perform inference on the quantized model, the support of quantized operators is required. There is no ready-made w2a16 operator available in the industry. , the authors developed the Gemm cuda kernel of w2 based on the w4 operator in Tensorrt-LLM, realizing efficient inference of the w2a16 model.
The quantization model itself is loaded and stored in the video memory in the form of 2bit weight, so it will occupy a relatively small video memory. Our cuda kernel loads the 2-bit weight into the register at runtime, and then uses hardware instructions to efficiently convert it into bf16 form and perform gemm operations with activation. Because our scenario is limited by latency, the batchsize in the generation stage is relatively small. At this time, matrix multiplication is limited by weight memory access. This implementation will greatly reduce the amount of memory access and improve the performance of the model. During the implementation process, algorithm search and SpiltK Parallel Reduce are combined to further improve the performance of the model. According to actual measurements, when batchsize=1, the performance of w2a16 Gemm on the L card can be improved by 1.4x-1.7x compared to w4a16.
Operator link: https://github.com/NVIDIA/TensorRT-LLM/pull/1568
# The implementation principle of kernel
5. Experiment
The author gives Bytedance’s internal ASR experimental results and open source experiments in the article Comparison results: The internal experimental results are:In this table, the author uses word err rate (WER) to measure the accuracy of ASR. The authors attempted to quantify the model to W2A16g64 using different methods. The wer of the floating-point model before quantization is 6.68%. After quantization using GPTQ [1], it is 6.83%. The wer of decoupleQ with block minimization after quantization is 6.70%. This result is very similar to the wer of the floating-point model before quantization. near. It also reports the time required for quantification. The price of high quantization accuracy is that quantization takes a long time. In actual business, after using decoupleQ to quantify the model, the integer part is fixed, and the labeled data set is used to fine-tune the scale and zero, and the accuracy of the model is further improved.
The results of the open source comparison experiment are:
This table is a comparison of the quantitative results of decoupleQ and other methods on Llama-1/2. Perplexity (PPL) is used as the evaluation index. It can be seen that under the same quantization configuration, the PPL of deoucpleQ will be lower than other methods most of the time.
6. Business benefits
decoupleQ Quantification technology is now widely used in ByteDance’s voice sector. It has been launched in speech generation models (Text-to-Speech), speech recognition models (automatic speech recognition), etc., and has been implemented in products such as Doubao, Feishu, and Douyin. A large number of online businesses show that based on decoupleQ quantification, the inference accuracy of W4A16 is completely on par with fp16/bf16 inference; the accuracy of W2A16 is only slightly worse than the fp16/bf16 accuracy (after the floating point part sft, the accuracy is on the same level as fp16/bf16) ). Although the paper only introduces weight-only quantification, in actual business, after weight is well quantified, activation quantification can be much simpler.
Compared with fp16, w8fp16, and w4fp16, we have achieved good acceleration effects in terms of hardware acceleration. In small batches, the performance of w2 matrix multiplication is 5-6 times higher than that of fp16, and 1.5-1.7 times higher than that of w4. . In terms of internal business models, w2fp16 has a performance improvement of 3-5 times compared to fp16, and a performance improvement of 1.25-1.4 times compared to w4fp16. It will also significantly reduce the memory occupied by the model weight, providing better memory utilization for the runtime. Lots of space.
7. Summary and Discussion
In the summary and discussion section, the author also pointed out Two current risks of decoupleQ are eliminated:
1. decoupleQ aims to use mathematical optimization methods to minimize the L2 loss before and after quantization. However, minimizing L2 loss at the layer level or block level may not necessarily represent the optimal accuracy of the final model;
2. During the optimization process of equations (5) and (7), when solving## When # and (s,z), only a small part of the calibration data is solved, which makes decoupleQ easy to overfit the calibration data.
Nonetheless, the author also pointed out that the idea of decoupling the model parameters into the integer part and the floating point part is very meaningful. If a labeled data set exists, we can fix the integer part after quantization and use the labeled data set to specifically train (s, z) to further improve the accuracy of the model. This not only ensures the generalization performance of the model (derived from the fixed integer part), but also can exert its ability on specific subtasks (derived from the fine-tuned floating point part). In ByteDance's actual business, after the previous version of the model is quantified and put online, when the next version is updated, only the floating point part of the model can be trained.
References:
【1】Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Optq : Accurate quantization for generative pretrained transformers. In The Eleventh International Conference on Learning Representations, 2022.
##【2】Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. Omniquant: Omnidirectionally calibrated quantization for large language models. arXiv preprint arXiv:2308.13137, 2023
【3】Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023.
The above is the detailed content of New ideas for quantification of byte open source large models, the accuracy of the 2-bit quantization model is on par with fp16. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics





It is also a Tusheng video, but PaintsUndo has taken a different route. ControlNet author LvminZhang started to live again! This time I aim at the field of painting. The new project PaintsUndo has received 1.4kstar (still rising crazily) not long after it was launched. Project address: https://github.com/lllyasviel/Paints-UNDO Through this project, the user inputs a static image, and PaintsUndo can automatically help you generate a video of the entire painting process, from line draft to finished product. follow. During the drawing process, the line changes are amazing. The final video result is very similar to the original image: Let’s take a look at a complete drawing.

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com The authors of this paper are all from the team of teacher Zhang Lingming at the University of Illinois at Urbana-Champaign (UIUC), including: Steven Code repair; Deng Yinlin, fourth-year doctoral student, researcher

If the answer given by the AI model is incomprehensible at all, would you dare to use it? As machine learning systems are used in more important areas, it becomes increasingly important to demonstrate why we can trust their output, and when not to trust them. One possible way to gain trust in the output of a complex system is to require the system to produce an interpretation of its output that is readable to a human or another trusted system, that is, fully understandable to the point that any possible errors can be found. For example, to build trust in the judicial system, we require courts to provide clear and readable written opinions that explain and support their decisions. For large language models, we can also adopt a similar approach. However, when taking this approach, ensure that the language model generates

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com In the development process of artificial intelligence, the control and guidance of large language models (LLM) has always been one of the core challenges, aiming to ensure that these models are both powerful and safe serve human society. Early efforts focused on reinforcement learning methods through human feedback (RL

Recently, the Riemann Hypothesis, known as one of the seven major problems of the millennium, has achieved a new breakthrough. The Riemann Hypothesis is a very important unsolved problem in mathematics, related to the precise properties of the distribution of prime numbers (primes are those numbers that are only divisible by 1 and themselves, and they play a fundamental role in number theory). In today's mathematical literature, there are more than a thousand mathematical propositions based on the establishment of the Riemann Hypothesis (or its generalized form). In other words, once the Riemann Hypothesis and its generalized form are proven, these more than a thousand propositions will be established as theorems, which will have a profound impact on the field of mathematics; and if the Riemann Hypothesis is proven wrong, then among these propositions part of it will also lose its effectiveness. New breakthrough comes from MIT mathematics professor Larry Guth and Oxford University

cheers! What is it like when a paper discussion is down to words? Recently, students at Stanford University created alphaXiv, an open discussion forum for arXiv papers that allows questions and comments to be posted directly on any arXiv paper. Website link: https://alphaxiv.org/ In fact, there is no need to visit this website specifically. Just change arXiv in any URL to alphaXiv to directly open the corresponding paper on the alphaXiv forum: you can accurately locate the paragraphs in the paper, Sentence: In the discussion area on the right, users can post questions to ask the author about the ideas and details of the paper. For example, they can also comment on the content of the paper, such as: "Given to

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com. Introduction In recent years, the application of multimodal large language models (MLLM) in various fields has achieved remarkable success. However, as the basic model for many downstream tasks, current MLLM consists of the well-known Transformer network, which

Show the causal chain to LLM and it learns the axioms. AI is already helping mathematicians and scientists conduct research. For example, the famous mathematician Terence Tao has repeatedly shared his research and exploration experience with the help of AI tools such as GPT. For AI to compete in these fields, strong and reliable causal reasoning capabilities are essential. The research to be introduced in this article found that a Transformer model trained on the demonstration of the causal transitivity axiom on small graphs can generalize to the transitive axiom on large graphs. In other words, if the Transformer learns to perform simple causal reasoning, it may be used for more complex causal reasoning. The axiomatic training framework proposed by the team is a new paradigm for learning causal reasoning based on passive data, with only demonstrations
