ICML 2024高分論文 | 零階優化器微調大模型，大幅降低內存-人工智慧-PHP中文網

AIxiv專欄是本站發布學術、技術內容的欄位。過去數年，本站AIxiv專欄接收通報了2,000多篇內容，涵蓋全球各大專院校與企業的頂尖實驗室，有效促進了學術交流與傳播。如果您有優秀的工作想要分享，歡迎投稿或聯絡報道。投稿信箱：liyazhou@jiqizhixin.com；zhaoyunfeng@jiqizhixin.com

本文共同第一作者簡介：張逸騖：密西根州立大學計算機系博士三年級學生，師從Sijia Liu教授，主要研究方向為大模型教授的安全、隱私和效率問題。李平治：本科畢業於中國科學技術大學，將於2024 Fall博士入學北卡羅來納大學教堂山分校，師從陳天龍教授，主要研究興趣集中在高效機器學習和AI4Science領域。洪駿遠：德州大學奧斯汀分校博後，導師是Zhangyang Wang教授。博士畢業於密西根州立大學，師從Jiayu Zhou教授，目前主要的研究方向是可信賴大語言模式和人工智慧的醫療應用。李佳翔：明尼蘇達大學博士後，目前在洪明毅教授和張樹中教授指導下做數值最佳化理論、機器學習理論和大規模機器學習最佳化問題的研究。

開源大語言模型（LLM）百花齊放，為了讓它們適應各種下游任務，微調（fine-tuning）是最廣泛採用的基本方法。基於自動微分技術（auto-differentiation）的一階優化器（SGD、Adam 等）雖然在模型微調中佔據主流，然而在模型越來越大的今天，卻帶來越來越大的顯存壓力。因此，如何有效率地在微調中降低顯存使得單卡可以滿足微調需求已經成為一個熱門研究問題。值得注意的是，雖然反向傳播是這些一階優化器的基石，被用於計算神經網路每個權重的梯度，同時卻也是顯存殺手，其中龐大計算圖的保存所帶來的開銷也在大模型時代被凸顯得尤為突出。同時，零階優化器（Zeroth-Order Optimization）則完全無需保存計算圖，轉而使用有限差分來近似計算網路的梯度，透過完全避免反向傳播（back-propagation; BP）來大幅減少神經網路更新中的記憶體開銷。

類似於一階優化器中隨機梯度下降的各式變種，零階優化器也有著各種先前無人探索的改進演算法。近日，來自密西根州立大學、北卡羅來納大學教堂山分校、德州大學奧斯汀分校、明尼蘇達大學雙城分校、IBM 研究中心、普林斯頓大學、以及阿里巴巴達摩院的眾多研究者聯合推出全面評測（ benchmark）文章：Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark 。這篇文章涵蓋六種無需反向傳播（BP-free）的優化器、五類大模型、三種複雜度的各項任務、四類微調方案，以及三項增強零階優化器的全新演算法。目前，相關論文已被 ICML 2024 高分接收，程式碼已開源；詳情如下。

ICML 2024高分论文 | 零阶优化器微调大模型，大幅降低内存

論文地址：https://arxiv.org/abs/2402.11592
碼位址：https://c零階最佳化講義位址(AAAI 2024 Tutorial)：https://sites.google.com/view/zo-tutorial-aaai-2024/

零階優化器是什麼？為何如此重要？

零階優化器（Zeroth-Order Optimization）僅依靠神經網路的輸出進行梯度估計，以完全不需要計算反向傳播和極少的內訓消耗而聞名。儘管在零階優化器領域也存在不同的梯度估計方法，本文特別指基於隨機梯度估計器（Random Gradient Estimator, RGE）的一系列演算法。簡單來說，就是透過從高斯分佈中抽取的隨機擾動來計算有限差分，並將其作為梯度的近似估計，RGE 數學公式如下所示。

Before this, zero-order optimization has been widely used in machine learning problems, such as adversarial sample generation and defense, black-box model interpretation, reinforcement learning and automatic machine learning; for detailed algorithm and application introduction, please see [1]. In the field of large models, MeZO [2] first proposed the use of zero-order stochastic gradient descent (ZO-SGD) as fine-tuning for large models and demonstrated the unlimited potential of zero-order optimizers. At the same time, ZO-SGD is the simplest and basic BP-free optimizer. Whether its many more advanced variants [3] can bring us more surprises in the field of large model fine-tuning is a topic that needs urgent research. This article systematically evaluates the performance, efficiency, and compatibility of the following optimization algorithms without backpropagation (BP-free) on large model fine-tuning tasks. The purpose is to show the community the breadth of the zero-order optimizer on a variety of large model tasks. Potential:

ZO-SGD: Zero-order stochastic gradient descent [4]
ZO-SGD-Sign: Sign-based zero-order stochastic gradient descent [5]
ZO-SGD-MMT: Zero-order stochastic gradient descent with momentum [6]
ZO-SGD-Cons: Zero-order stochastic gradient descent with conservative gradient updates [7]
ZO-Adam: Zero-order Adam optimizer [8]

This study also includes the Forward-Grad [9] method, which unbiasedly estimates the gradient based on directional derivatives along random direction vectors. It is worth noting that although Forward-Grad does not directly use gradient backpropagation, it still uses an automatic differentiation algorithm, so it is a first-order BP-free algorithm.

To sum up, the evaluation of this article includes the above five zero-order optimizers and the Forward-Grad method, while comparing the most commonly used first-order optimizers, FO-SGD and FO-Adam. In terms of specific fine-tuning forms, the evaluation comprehensively covers 5 LLM architectures (RoBERTa, OPT, LLaMA, Vicuna, Mistral), 3 tasks of different complexity (SST2, COPA, WinoGrande), and 4 fine-tuning solutions (full-tuning , LoRA, prompt tuning, prefix tuning).

Large model fine-tuning accuracy evaluation

The author pointed out that in order to effectively use the zero-order optimizer to fine-tune large models on downstream tasks, the input template must be used rationally so that the downstream tasks can be Aligned with pre-trained tasks. For example, for SST2, using the template "SENTENCE. It was [terrible|great]." can bring about a 10% performance improvement on ZO-SGD. However, for first-order optimizers (such as FO-SGD), the performance difference is not significant whether or not a template is used, which highlights the uniqueness of the zero-order optimizer.

SST2 As a more basic task, its experimental results can support the following conclusions:

ZO-Adam seems to be the most effective zero-order optimizer: 4 out of 8 fine-tuned settings Best performance in .
Forward-grad is a competitive but previously overlooked method, especially in full fine-tuning.
ZO-SGD-Cons and ZO-SGD-MMT also demonstrate strong performance, while ZO-SGD-Sign, as the simplest zero-order optimizer, is often the weakest method.

ICML 2024高分论文 | 零阶优化器微调大模型，大幅降低内存

Further, the study used the larger model OPT-13B to conduct experiments on more complex and difficult tasks (COPA and WinoGrande) and reached the following conclusions:

In more complex tasks, the performance differences between different optimizers are further magnified.
ZO-Adam and ZO-SGD-MMT demonstrated very good stability under various experiments, which may be attributed to the reduced variance design.
LoRA fine-tuning has always shown strong robustness to zero-order algorithms, and is stable and reliable in various experimental environments.

ICML 2024高分论文 | 零阶优化器微调大模型，大幅降低内存

Evaluation and detailed explanation of large model fine-tuning memory overhead

Taking the fine-tuning of the OPT-13B model on the MultiRC data set as an example, the author further compared and analyzed the memory and time costs of different zero-order and first-order optimizers. As shown in the following table: First, from the perspective of memory efficiency, ZO-SGD, ZO-SGD-Cons and ZO-SGD-Sign show similar high memory efficiency, requiring only one A100 GPU for fine-tuning of large language models . This is not surprising, as these zero-order optimizers employ relatively simple optimization steps, relying mainly on the utilization of the zero-order gradient estimator RGE. Second, Forward-Grad appears to be the tipping point at which zero-order optimization methods surpass first-order methods in terms of memory efficiency (e.g. compared to ZO-Adam). Finally, compared with the first-order method, the running time cost of each iteration of zero-order optimization is reduced by about 41.9% (taking ZO-SGD vs. FO-SGD as an example).

ICML 2024高分论文 | 零阶优化器微调大模型，大幅降低内存

The author further compared the memory efficiency of ZO-SGD and FO-SGD under different sequence lengths. It can be seen that the memory consumption of ZO-SGD remains consistent because its peak memory consumption is only determined by the model parameter size. In contrast, as the sequence length increases, the peak memory consumption of FO-SGD first remains unchanged and then start to increase. Therefore, in the setting of long context length, ZO-SGD will exhibit better memory efficiency advantages. For specific memory theoretical values and experimental values, please refer to the original paper.

ICML 2024高分论文 | 零阶优化器微调大模型，大幅降低内存

Three improved algorithms to enhance the zero-order optimizer

Zero-order optimizers have limited convergence efficiency when applied to LLM, mainly because of their large variance in gradient estimates. In order to further enhance the zero-order optimizer, the author proposed three advanced algorithms from the perspective of reducing the variance of gradient estimation, including: block-wise ZO fine-tuning, zero-order and first-order hybrid fine-tuning (hybrid ZO and FO fine-tuning), zero-order gradient estimation that introduces sparsity (sparsity-induced ZO gradient estimation).

Block-wise ZO fine-tuningThe main starting point of this method is that if the zero-order optimizer perturbs the parameter blocks in the LLM separately when estimating the gradient, by reducing The problem size is used to take into account the variance of each gradient estimate, thereby improving optimization performance. The advantage of this method is that it can estimate the model gradient more accurately, but the number of forward propagations required to complete a gradient estimation will increase. For example, OPT-1.3B can be divided into 26 parameter blocks (24 Transformers layers, embedding layers, and LM classification head), so the zero-order optimizer will compute 26 forward passes each time it computes the model gradient. In order to fairly compare ZO-SGD and ZO-SGD-Block, the author also compared the performance of another ZO-SGD variant, which performs parameter perturbations on the complete model each time and estimates the gradient after multiple perturbations. Calculate the average (for example, 26 times for OPT-1.3B) to ensure that the number of forward propagations during comparison is the same. Experimental results on OPT-1.3B show that ZO-SGD-Block significantly surpasses the two ZO-SGDs.

ICML 2024高分论文 | 零阶优化器微调大模型，大幅降低内存

Hybrid ZO and FO fine-tuningBackpropagation (BP) calculates the weight gradient from deep to shallow neural networks in sequence. Since the zero-order optimizer has a far greater advantage in memory usage than the traditional first-order optimizer, the performance of the first-order optimizer is often better. Therefore, using a combination of zero-order and first-order optimizers will achieve a trade-off between memory usage and performance. Specifically, for deeper networks, a first-order optimizer can be used to accurately calculate the gradient through backpropagation; for shallow networks, a zero-order optimizer can be used for gradient estimation. Experimental results show that using a zero-order optimizer in the shallow part (such as the first 8/24 layers of OPT-1.3B) and using a first-order optimizer in the remaining deep parts can save about one-third of the video memory. At the same time, the same performance level as using entirely a first-order optimizer is achieved.

ICML 2024高分论文 | 零阶优化器微调大模型，大幅降低内存

Zeroth-order optimizer using sparse gradient (ZO with gradient pruning)In first-order optimizers, gradient pruning is usually used to speed up the training process; while in zero-order optimizers, the sparse gradient introduced through gradient pruning can be further Reduces the variance of gradient estimates, thereby improving performance. This paper studies the application of amplitude-based pruning strategy in a zero-order optimizer to obtain the sparsity rate of each layer, and then generates random sparse gradient masks (mask) based on these sparsity rates, and applies them to stochastic gradient estimation. Disturbance on. Experimental results show that moderate gradient sparsity (about 20%) can bring a certain degree of performance improvement to the zero-order optimizer.

ICML 2024高分论文 | 零阶优化器微调大模型，大幅降低内存

Conclusion

In this paper, we demonstrated the effective application of zero-order optimizers in fine-tuning large language models. By using loss differences to approximate gradients, the zero-order optimization method avoids the need for backpropagation and activation storage, greatly saving memory resources. By expanding the scope of existing research, we included different zero-order optimization methods, task types and evaluation indicators into this evaluation, and conducted the first systematic benchmark study of zero-order optimization technology. Our study not only reveals how these methods perform in terms of accuracy and efficiency, but also provides insights into the critical role of task alignment and forward gradient. Using these experimental analyses, we propose techniques such as block optimization, zero-order and first-order hybrid training, and gradient sparsification to further enhance fine-tuning of large models based on zero-order optimization. These improvements are designed to improve fine-tuning accuracy while maintaining memory efficiency.

We firmly believe that the application of these discoveries and technologies can greatly reduce the hardware resource requirements for large model research, making large model fine-tuning possible on low-end GPUs, thereby further promoting academic research and producing practical and practical results in the industry. valuable impact. We encourage researchers and technology developers to pay attention to our research results and explore more possibilities using ZO optimization. Future research will continue to explore deep issues in this area to unlock more potential in LLM fine-tuning.

For more information, please refer to the paper and GitHub repository for more information and resources.

^Reference:

^{[1] Liu, et al,. "A primer on zeroth-order optimization in signal processing and machine learning." IEEE Signal Processing Magazine 37, no. 5 ( 2020): 43-54.}

^{[2] Malladi, et al., "Fine-Tuning Language Models with Just Forward Passes." NeurIPS' 2023.}

^{[3] Liu, et al. al., “A Primer on Zeroth-Order Optimization in Signal Processing and Machine Learning.” IEEE Signal Processing Magazine.}

^{[4] Ghadimi, et al., “Stochastic First- and Zeroth-order Methods for Nonconvex Stochastic Programming."}

^{[5] Liu, et al., "signSGD via Zeroth-Order Oracle." ICLR' 2019.}

^{[6] Huang, et al., "Accelerated Zeroth -Order and First-Order Momentum Methods from Mini to Minimax Optimization.” JMLR' 2022.}

^{[7] Kim, et al., “Curvature-Aware Derivative-Free Optimization.” [8] Chen, et al., "ZO-AdaMM: Zeroth-Order Adaptive Momentum Method for Black-Box Optimization."}

^{[9] Baydin, et al., "Gradients without Backpropagation."}

以上是ICML 2024高分論文 | 零階優化器微調大模型，大幅降低內存的詳細內容。更多資訊請關注PHP中文網其他相關文章！