


Can't a language model with 10 billion parameters run? A Chinese doctor from MIT proposed SmoothQuant quantification, which reduced memory requirements by half and increased speed by 1.56 times!
Although large-scale language models (LLM) have strong performance, the number of parameters can easily reach hundreds or hundreds of billions, and the demand for computing equipment and memory is so large that ordinary companies cannot afford it.
Quantization is a common compression operation. By reducing the accuracy of model weights (such as 32bit to 8bit), some model performance is sacrificed in exchange for faster inference speed, and more Low memory requirements.
But for LLMs with more than 100 billion parameters, existing compression methods cannot maintain the accuracy of the model, nor can they run efficiently on hardware.
Recently, researchers from MIT and NVIDIA jointly proposed a general-purpose post-training quantization (GPQ, general-purpose post-training quantization) solution SmoothQuant, for large language models, 8-bit weights and 8-bit activation (W8A8) quantification can be efficiently realized, and the accuracy of the model can be maintained without training.
##Paper link: https://arxiv.org/pdf/2211.10438.pdf
Code link: https://github.com/mit-han-lab/smoothquant
Since activation is more difficult to quantify than weight, SmoothQuant transfers activations that are difficult to quantify to weights through mathematical equivalent transformation, achieving smooth processing of activation outliers.
SmoothQuant can quantize weights and activations in various layers of all LLMs to INT8, including OPT-175B, BLOOM-176B and GLM-130B.
Compared with existing methods that only perform weight optimization or quantize activations with mixed precision, SmoothQuant has higher hardware efficiency and achieves 1.56 times acceleration. The memory requirements are only half that of the original LLM, and there is almost no loss in accuracy.
Instructor Song Han is an associate professor at MIT EECS. He graduated from Stanford University with a PhD. His main research direction is efficient deep learning. He once proposed deep compression technology, which can transform neural networks into The size is reduced by an order of magnitude without losing accuracy.
SmoothQuant
Quantization (Quantization) is to map high-precision values to lower-precision discrete values. In this paper, researchers mainly focus on improving hardware Efficient integer uniform quantization, especially INT8.
Quantization operations can be performed at different granularities, such as per-tensor quantization is applied to the entire weight matrix, and per-token quantization is applied to activations For each token, per-channel quantization is applied to each output channel of the weight.
#By observing the quantitative results of activation, the researchers concluded several patterns:
#1. Quantification is more difficult to quantify than weight.
The distribution of weights is relatively more uniform and flatter. Previous research results have proven that reducing the weight of a large language model to INT8 or even INT4 has little impact on accuracy. #2. Outliers are the main difficulty in activation quantification. #Outliers in activation are usually about 100 times higher than normal values, resulting in very low efficiency of quantization bits/levels in channels without outliers. 3. Abnormal values are fixed in a certain channel. Outliers will only appear in a small number of channels, but if there is an outlier in one channel, the outlier may appear in all appears in the token. The variance of all channels in a given token will be large (some channels will be very large, but most will be small), but given The variance of a channel across all token degrees will be small (outlier channels will be large). Since outliers have the characteristics of continuous occurrence and small variance within each channel, if per-channel quantization is performed on activations, the quantization error will be much smaller than per-tensor quantization . Through a simple experiment, the results once again verified the researchers’ ideas. When quantized to INT8, the per-channel accuracy is much higher than per-tensor and per-token. Quantification, the accuracy is almost the same as the FP16 baseline. The researchers smoothed the input activation by using a per-channel smoothing factor s. To maintain mathematical equivalence of linear layers, the weights also need to be inversely scaled. Since the input X is usually generated by previous linear operations (such as linear layers, layer norms, etc.), it can be easily The smoothing factor is blended into the parameters of the previous layer offline and does not incur the kernel call overhead of additional scaling. For other cases, such as when the input comes from residual add, an additional scaling can be added to the residual branch. The goal of Smooth is to choose a per-channel smoothing factor s such that the inverse Operations are easier to quantify. In order to reduce the quantization error, the effective quantization bits of all channels should be increased. When the maximum magnitude of all channels is the same, the total number of effective quantization bits will be the largest. Therefore, one of the most direct smoothing factor choices is the maximum value of each channel in the input, which can ensure that after division, all activation channels have the same maximum value, thus achieving easier quantification. But it should be noted that the activation range is dynamic and different for different input samples. So the researchers used calibration samples from the pre-training dataset to estimate the size of the activation channels. Since this formula transfers all quantification difficulties to the weights, it can be found that in this case, the quantization error of the weights will be very large, resulting in a large decrease in accuracy. On the other hand, it is also possible to push all quantization difficulties from weights to activations by choosing sj = 1/ max(|Wj |). Likewise, model performance is also poor due to excessive activation quantization errors. Therefore the quantification difficulty needs to be split between weights and activations to make them both easy to quantify. The researchers introduced a hyperparameter transfer strength α to control the difficulty of transferring from activations to weights. It can be found that for most models, such as OPT and BLOOM models, α=0.5 is a good balance point, which can evenly distribute the quantization difficulty, especially using the same quantizer Perform weighting and activation. This formula ensures that the weights and activations of corresponding channels have similar maximum values and thus share the same quantization difficulty. For some other models with relatively large activation outliers, such as GLM-130B with 30% outliers, which is more difficult for activation quantification, you can choose a larger A large α (such as 0.75) transfers more quantification difficulty to the weights. SmoothQuant is applied to the Transformer block The linear layer takes up most of the parameters and calculations of the LLM model. By default, SmoothQuant scales the input activations of all linear layers in the Transformer and quantizes the linear layers with W8A8, which enables quantization of the BMM operator in the attention calculation. In the process, INT8 is first used to quantify the inputs and weights of computationally intensive operators such as BMM in the linear layer and attention layer, while other light Operations on magnitude elements, such as Softmax and LayerNorm, remain activated as FP16. This design helps balance accuracy and reasoning efficiency. The researchers selected three large-scale language models to evaluate SmoothQuant, including OPT, BLOOM and GLM-130B; and used seven zero-shot tasks, including LAMBADA, HellaSwag , PIQA, WinoGrande, OpenBookQA, RTE, COPA, etc. Experimental results show that SmoothQuant can handle the quantization problem of very large LLMs, and its activation is more difficult to quantify. SmoothQuant can match the accuracy of FP16 on all evaluation datasets, while the W8A8, ZeroQuant and Outlier Suppression baselines produce almost random results. And SmoothQuant can losslessly quantize all open LLMs with more than 100B parameters SmoothQuant’s O1 and O2 levels successfully maintain floating point accuracy, while Level O3 (per-tensor static) reduces average accuracy by 0.8%, likely due to the difference between statically collected statistics and activation statistics of real evaluation samples. Nonetheless, SmoothQuant-O1 can match the accuracy of FP16, while SmoothQuant-O3 only reduces the accuracy by 1%, which is significantly better than the baseline. SmoothQuant is not only effective for very large LLMs with over 100B parameters, but also has stable results for smaller LLMs. SmoothQuant can work on all scales of OPT models and match the FP16 accuracy of INT8 quantization . To demonstrate the speed improvements and memory savings of SmoothQuant-O3 integrated into PyTorch and FasterTransformer, we measured all hidden states generating a batch of 4 sentences at a time The end-to-end delay, that is, the delay in the context stage, and records the peak GPU memory usage during this process. Due to Huggingface's lack of support for model parallelism, the researchers only measured the performance of SmoothQuant's PyTorch implementation on a single GPU, so OPT-6.7B, OPT-13B and OPT-30B were selected for evaluation. In the FasterTransformer library, SmoothQuant can be seamlessly connected with the Tensor Parallelism algorithm, so the researchers tested SmoothQuant’s single-GPU and multi-GPU benchmarks on OPT-13B, OPT-30B, OPT-66B and OPT-175B. . Experimental results conducted on NVIDIA A100 80GB GPU server show that SmoothQuant is always faster than the FP16 baseline in terms of inference latency and peak memory usage based on PyTorch implementation, when the sequence length is 256, on OPT-30B Obtained a 1.51 times speed increase. You can also see a trend that the larger the model, the more obvious the speedup, but LLM.int8() is almost always slower than the FP16 baseline, also due to mixed precision Caused by the huge overhead of activating representations. In terms of memory, both SmoothQuant and LLM.int8() can almost halve the memory usage of the FP16 model, while SmoothQuant saves slightly more memory because it completely uses INT8 GEMM. Compared with FasterTransformer's FP16 implementation of OPT, SmoothQuant-O3 can further reduce the execution latency of OPT-13B and OPT-30B when using a single GPU, by up to 1.56 times. Transfer quantization difficulty from activations to weights
Experimental part
The above is the detailed content of Can't a language model with 10 billion parameters run? A Chinese doctor from MIT proposed SmoothQuant quantification, which reduced memory requirements by half and increased speed by 1.56 times!. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



Imagine an artificial intelligence model that not only has the ability to surpass traditional computing, but also achieves more efficient performance at a lower cost. This is not science fiction, DeepSeek-V2[1], the world’s most powerful open source MoE model is here. DeepSeek-V2 is a powerful mixture of experts (MoE) language model with the characteristics of economical training and efficient inference. It consists of 236B parameters, 21B of which are used to activate each marker. Compared with DeepSeek67B, DeepSeek-V2 has stronger performance, while saving 42.5% of training costs, reducing KV cache by 93.3%, and increasing the maximum generation throughput to 5.76 times. DeepSeek is a company exploring general artificial intelligence

Earlier this month, researchers from MIT and other institutions proposed a very promising alternative to MLP - KAN. KAN outperforms MLP in terms of accuracy and interpretability. And it can outperform MLP running with a larger number of parameters with a very small number of parameters. For example, the authors stated that they used KAN to reproduce DeepMind's results with a smaller network and a higher degree of automation. Specifically, DeepMind's MLP has about 300,000 parameters, while KAN only has about 200 parameters. KAN has a strong mathematical foundation like MLP. MLP is based on the universal approximation theorem, while KAN is based on the Kolmogorov-Arnold representation theorem. As shown in the figure below, KAN has

Boston Dynamics Atlas officially enters the era of electric robots! Yesterday, the hydraulic Atlas just "tearfully" withdrew from the stage of history. Today, Boston Dynamics announced that the electric Atlas is on the job. It seems that in the field of commercial humanoid robots, Boston Dynamics is determined to compete with Tesla. After the new video was released, it had already been viewed by more than one million people in just ten hours. The old people leave and new roles appear. This is a historical necessity. There is no doubt that this year is the explosive year of humanoid robots. Netizens commented: The advancement of robots has made this year's opening ceremony look like a human, and the degree of freedom is far greater than that of humans. But is this really not a horror movie? At the beginning of the video, Atlas is lying calmly on the ground, seemingly on his back. What follows is jaw-dropping

AI is indeed changing mathematics. Recently, Tao Zhexuan, who has been paying close attention to this issue, forwarded the latest issue of "Bulletin of the American Mathematical Society" (Bulletin of the American Mathematical Society). Focusing on the topic "Will machines change mathematics?", many mathematicians expressed their opinions. The whole process was full of sparks, hardcore and exciting. The author has a strong lineup, including Fields Medal winner Akshay Venkatesh, Chinese mathematician Zheng Lejun, NYU computer scientist Ernest Davis and many other well-known scholars in the industry. The world of AI has changed dramatically. You know, many of these articles were submitted a year ago.

The performance of JAX, promoted by Google, has surpassed that of Pytorch and TensorFlow in recent benchmark tests, ranking first in 7 indicators. And the test was not done on the TPU with the best JAX performance. Although among developers, Pytorch is still more popular than Tensorflow. But in the future, perhaps more large models will be trained and run based on the JAX platform. Models Recently, the Keras team benchmarked three backends (TensorFlow, JAX, PyTorch) with the native PyTorch implementation and Keras2 with TensorFlow. First, they select a set of mainstream

The latest video of Tesla's robot Optimus is released, and it can already work in the factory. At normal speed, it sorts batteries (Tesla's 4680 batteries) like this: The official also released what it looks like at 20x speed - on a small "workstation", picking and picking and picking: This time it is released One of the highlights of the video is that Optimus completes this work in the factory, completely autonomously, without human intervention throughout the process. And from the perspective of Optimus, it can also pick up and place the crooked battery, focusing on automatic error correction: Regarding Optimus's hand, NVIDIA scientist Jim Fan gave a high evaluation: Optimus's hand is the world's five-fingered robot. One of the most dexterous. Its hands are not only tactile

Target detection is a relatively mature problem in autonomous driving systems, among which pedestrian detection is one of the earliest algorithms to be deployed. Very comprehensive research has been carried out in most papers. However, distance perception using fisheye cameras for surround view is relatively less studied. Due to large radial distortion, standard bounding box representation is difficult to implement in fisheye cameras. To alleviate the above description, we explore extended bounding box, ellipse, and general polygon designs into polar/angular representations and define an instance segmentation mIOU metric to analyze these representations. The proposed model fisheyeDetNet with polygonal shape outperforms other models and simultaneously achieves 49.5% mAP on the Valeo fisheye camera dataset for autonomous driving

FP8 and lower floating point quantification precision are no longer the "patent" of H100! Lao Huang wanted everyone to use INT8/INT4, and the Microsoft DeepSpeed team started running FP6 on A100 without official support from NVIDIA. Test results show that the new method TC-FPx's FP6 quantization on A100 is close to or occasionally faster than INT4, and has higher accuracy than the latter. On top of this, there is also end-to-end large model support, which has been open sourced and integrated into deep learning inference frameworks such as DeepSpeed. This result also has an immediate effect on accelerating large models - under this framework, using a single card to run Llama, the throughput is 2.65 times higher than that of dual cards. one
