Since the emergence of GPT-3 in 2020, the popularity of ChatGPT has once again brought the GPT family’s generative large-scale language models into the spotlight, and they have shown strong performance in various tasks.
But the huge scale of the model also brings about an increase in computing costs and an increase in deployment difficulty.
For example, the GPT‑175B model occupies a total of at least 320GB of storage space in half-precision (FP16) format. During inference, at least five A100 GPUs with 80 GB storage space are required.
Model compression is currently a commonly used method to reduce the computational cost of large models, but so far, almost all existing GPT compression methods focus on quantification. (quantization), that is, reducing the accuracy of the numerical representation of a single weight.
Another model compression method is pruning, which removes network elements, ranging from individual weights (unstructured pruning) to higher-granular components such as weight matrices of the entire row/column (structured pruning). This approach works well in vision and smaller-scale language models, but it results in a loss of accuracy, requiring extensive retraining of the model to restore accuracy, so the cost becomes again when it comes to large-scale models like GPT. Too expensive. Although there are some single-shot pruning methods that can compress the model without retraining, they are too computationally intensive and difficult to apply to models with billions of parameters.
So for a large model of the size of GPT-3, is there a way to accurately prune it while maintaining minimal accuracy loss and reducing computational costs?
Recently, two researchers from the Austrian Institute of Science and Technology (ISTA), Elias Frantar and Dan Alistarh, collaborated on a study that, for the first time, targeted a model scale of 10 to 100 billion parameters. An accurate single-shot pruning method SparseGPT is proposed.
##Paper address: https://arxiv.org/pdf/2301.00774.pdf SparseGPT can prune the GPT series model to 50% sparsity in a single step without any retraining. The largest publicly available model, GPT-175B, achieves this pruning in just a few hours using a single GPU. Moreover, SparseGPT is also very accurate and can minimize the loss of accuracy. For example, when executing SparseGPT on the currently largest open source models OPT‑175B and BLOOM‑176B, a sparsity of 60% can be achieved while minimizing the loss of accuracy. Electric Drive SparseGPT Algorithm Research on very large models has been very active in recent years, but so far, there has not been one with more than 10 billion parameters The model is able to achieve very accurate high sparsification. Existing methods have too high requirements on computational cost. Taking the most accurate post-training method OBC as an example, it takes more than 1 hour for a billion-parameter model. to perform compression. The fastest known post-training method, AdaPrune, also takes minutes to prune a billion-parameter model, and at this rate, a model at the scale of GPT-3 is estimated to require hundreds of hours (weeks) of computation. Most existing pruning methods such as gradual magnitude pruning require extensive retraining after the pruning step to restore accuracy, while GPT scale Models usually require a large amount of computation and parameter adjustment for training or fine-tuning, which makes retraining-based methods difficult to apply. Therefore, applying this progressive pruning approach at GPT scale is not feasible. This work by the ISTA team proposes the SparseGPT method, which can run models with more than 100 billion parameters on a single GPU in a few hours, and is accurate enough to prune the model to 50 %-60% sparsity levels without significantly degrading performance. The core of SparseGPT is a new large-scale approximate sparse regression algorithm that can be generalized to semi-structured (2:4 and 4:8) patterns and is compatible with existing Compatible with weight quantification methods. Most existing pruning methods, such as progressive magnitude pruning, require pruning Steps are followed by extensive retraining to restore accuracy, and GPT-scale models often require a large amount of computation and parameter adjustment for training or fine-tuning, which makes retraining-based methods difficult to apply. Therefore, applying this progressive pruning approach at GPT scale is not feasible. SparseGPT is a post-training method for GPT-scale models because it does not perform any fine-tuning.There are currently many methods to quantify post-training of GPT-scale models, such as ZeroQuant, LLM.int8() and nuQmm, etc., but activation quantization may be difficult due to the presence of abnormal features. GPTQ utilizes approximate second-order information to accurately quantize weights to 2‑4 bits, suitable for the largest models, and when combined with efficient GPU cores, can lead to 2‑5x inference acceleration.
But since GPTQ focuses on sparsification rather than quantification, SparseGPT is a complement to the quantification method, and the two can be applied in combination.
In addition, in addition to unstructured pruning, SparseGPT is also suitable for semi-structured patterns, such as the popular n:m sparse format, which can be used in a ratio of 2:4 on Ampere NVIDIA GPUs Achieve acceleration.
After evaluating the effectiveness of the SparseGPT compression model, researchers found that large languages The difficulty of model sparsification is proportional to the model size. Compared with the existing magnitude pruning (Magnitude Pruning) method, using SparseGPT can achieve a higher degree of model sparseness while maintaining a minimum loss of accuracy.
The researchers implemented SparseGPT on PyTorch and used HuggingFace’s Transformers library to process the model and dataset, all on a single NVIDIA A100 GPU with 80GB of memory. Under such experimental conditions, SparseGPT can achieve complete sparsification of a 175 billion parameter model in approximately 4 hours.
The researchers sparse Transformer layers sequentially, which significantly reduces memory requirements and also greatly improves the accuracy of processing all layers in parallel. All compression experiments were performed in one go without any fine-tuning.
The evaluation objects are mainly OPT series models, which include a set of models from 125 million to 175 billion parameters, making it easy to observe the scaling performance of pruning relative to the model size. Additionally, 176 billion parameter variants of BLOOM were analyzed.
In terms of data sets and evaluation indicators, the experiment used the perplexity of the original WikiText2 test set to evaluate the accuracy of the SparseGPT compression method. At the same time, in order to increase the interpretability, some ZeroShot accuracy metric. Additionally, the evaluation focuses on the accuracy of the sparse model relative to the dense model baseline, rather than on absolute numbers.
The researchers pruned all linear layers of the entire OPT model series (excluding standard embeddings and headers) to achieve 50% unstructured sparsity, full 4 :8 or full 2:4 semi-structured sparsity, the result is as shown below.
It can be seen that the accuracy of the model compressed using amplitude pruning is poor at all sizes, and the model becomes smaller. The larger the value, the greater the accuracy decreases.
The trend of the model compressed using SparseGPT is different. Under 2.7 billion parameters, the perplexity loss is
A general trend is that larger models are more likely to be sparsified. At sparsity levels, the relative accuracy drop of sparse models relative to dense models shrinks as the model size increases. The authors speculate that this may be due to their higher degree of parameterization and overall greater noise immunity.
Compared to the dense model baseline, at the maximum scale, when using SparseGPT to compress the model to 4:8 and 2:4 sparsity, the perplexity increases are only 0.11 and 0.39 respectively. . This result means that we can achieve a 2x speedup in practice, and commercial NVIDIA Ampere GPUs already support 2:4 sparsity.
The author studied the relationship between the performance of two hundred billion models, OPT-175B and BLOOM-176B, and the degree of sparsity brought about by using SparseGPT. The results are shown in the figure below.
##It can be seen that for the OPT-175B model, amplitude pruning can achieve up to 10% sparsity. Then there will be a greater loss of accuracy. SparseGPT can also achieve 60% sparsity with increasing perplexity.For the BLOOM-176B model, although amplitude pruning can achieve 30% sparsity without significant accuracy loss, in comparison, SparseGPT can achieve 50% sparsity, a 1.66x improvement. Moreover, at 80% sparsity, the perplexity of the model compressed using SparseGPT still remains at a reasonable level, but when amplitude pruning reaches 40% sparsity of OPT and 60% sparsity of BLOOM, the perplexity is already > 100.
Additionally, SparseGPT is able to remove approximately 100 billion weights from these models, with limited impact on model accuracy.
Finally, this study shows for the first time that a large-scale pre-trained model based on Transformer can be compressed to high sparsity through one-time weight pruning without any retraining and a small accuracy loss. Low.
It is worth noting that SparseGPT’s approach is local: after each pruning step, it performs weight updates designed to preserve the input-output relationships of each layer. These updates is calculated without any global gradient information. Therefore, the high degree of parameterization of large-scale GPT models appears to enable this approach to directly identify sparse accurate models among the "neighbors" of dense pre-trained models.
In addition, because the accuracy indicator (perplexity) used in the experiment is very sensitive, the generated sparse model output seems to be closely related to the output of the dense model.
This research has great positive significance in alleviating the computing power limitations of large models. One future work direction is to study the fine-tuning mechanism of large models to further restore accuracy. At the same time, Expanding the applicability of SparseGPT's methods during model training will reduce the computational cost of training large models.
The above is the detailed content of The first 100 billion model compression algorithm SparseGPT is here, reducing computing power costs while maintaining high accuracy. For more information, please follow other related articles on the PHP Chinese website!