65 billion parameters, 8 GPUs can fine-tune all parameters: Qiu Xipeng's team has lowered the threshold for large models-AI-php.cn

Table of Contents

Home

65 billion parameters, 8 GPUs can fine-tune all parameters: Qiu Xipeng's team has lowered the threshold for large models

王林

Jun 20, 2023 pm 03:57 PM

Model science and technology

In the direction of large models, technology giants are training larger models, while academia is thinking of ways to optimize them. Recently, the method of optimizing computing power has risen to a new level.

Large-scale language models (LLM) have revolutionized the field of natural language processing (NLP), demonstrating extraordinary capabilities such as emergence and epiphany. However, if you want to build a model with certain general capabilities, billions of parameters are needed, which greatly raises the threshold for NLP research. The LLM model tuning process usually requires expensive GPU resources, such as an 8×80GB GPU device, which makes it difficult for small laboratories and companies to participate in research in this field.

Recently, people are studying parameter efficient fine-tuning techniques (PEFT), such as LoRA and Prefix-tuning, which provide solutions for tuning LLM with limited resources. However, these methods do not provide practical solutions for full-parameter fine-tuning, which has been recognized as a more powerful method than parameter-efficient fine-tuning.

In the paper "Full Parameter Fine-tuning for Large Language Models with Limited Resources" submitted by Qiu Xipeng's team at Fudan University last week, researchers proposed a new optimizer LOw- Memory Optimization (LOMO).

By integrating LOMO with existing memory saving techniques, the new approach reduces memory usage to 10.8% compared to the standard approach (DeepSpeed solution). As a result, the new approach enables full parameter fine-tuning of a 65B model on a machine with 8×RTX 3090s, each with 24GB of memory.

65 billion parameters, 8 GPUs can fine-tune all parameters: Qiu Xipengs team has lowered the threshold for large models

Paper link: https://arxiv.org/abs/2306.09782

In this work, the author analyzed four aspects of memory usage in LLM: activation, optimizer state, gradient tensor and parameters, and optimized the training process in three aspects:

We rethought the function of the optimizer from an algorithmic perspective and found that SGD is a good alternative in fine-tuning the complete parameters of LLM. This allows the authors to delete entire parts of the optimizer state, since SGD does not store any intermediate state.
The newly proposed optimizer LOMO reduces the memory usage of gradient tensors to O(1), which is equivalent to the memory usage of the largest gradient tensor.
To stabilize mixed-precision training using LOMO, the authors integrate gradient normalization, loss scaling, and convert certain calculations to full precision during training.

#New technology makes memory usage equal to parameter usage plus activation and maximum gradient tensors. The memory usage of full parameter fine-tuning is pushed to the extreme, which is only equivalent to the usage of inference. This is because the memory footprint of the forward backward process should be no less than that of the forward process alone. It is worth noting that when using LOMO to save memory, the new method ensures that the fine-tuning process is not affected, because the parameter update process is still equivalent to SGD.

The study evaluated the memory and throughput performance of LOMO and showed that with LOMO, researchers can train a 65B parameter model on 8 RTX 3090 GPUs. Furthermore, to verify the performance of LOMO on downstream tasks, they applied LOMO to tune all parameters of LLM on the SuperGLUE dataset collection. The results demonstrate the effectiveness of LOMO for optimizing LLMs with billions of parameters.

Method introduction

In the method section, this article introduces LOMO (LOW-MEMORY OPTIMIZATION) in detail. Generally speaking, the gradient tensor represents the gradient of a parameter tensor, and its size is the same as the parameters, which results in larger memory overhead. Existing deep learning frameworks such as PyTorch store gradient tensors for all parameters. Currently, there are two reasons for storing gradient tensors: computing the optimizer state and normalizing gradients.

Since this study adopts SGD as the optimizer, there is no gradient-dependent optimizer state, and they have some alternatives to gradient normalization.

They proposed LOMO, as shown in Algorithm 1, which fuses gradient calculation and parameter update in one step, thus avoiding the storage of gradient tensors.

The following figure shows the comparison between SGD and LOMO in the backpropagation and parameter update stages. Pi is the model parameter, and Gi is the gradient corresponding to Pi. LOMO integrates gradient calculation and parameter update into a single step to minimize the gradient tensor.

65 billion parameters, 8 GPUs can fine-tune all parameters: Qiu Xipengs team has lowered the threshold for large models

LOMO corresponding algorithm pseudo code:

65 billion parameters, 8 GPUs can fine-tune all parameters: Qiu Xipengs team has lowered the threshold for large models

Specifically, the study represents vanilla gradient descent as

65 billion parameters, 8 GPUs can fine-tune all parameters: Qiu Xipengs team has lowered the threshold for large models

, which is a two-step process, first is to calculate the gradient and then update the parameters. The fused version is

65 billion parameters, 8 GPUs can fine-tune all parameters: Qiu Xipengs team has lowered the threshold for large models

The key idea of this research is to update the parameters immediately when calculating the gradient, so that the gradient tensor is not stored in memory. This step can be achieved by injecting hook functions into backpropagation. PyTorch provides related APIs for injecting hook functions, but it is impossible to achieve precise instant updates with the current API. Instead, this study stores the gradient of at most one parameter in memory and updates each parameter one by one with backpropagation. This method reduces the memory usage of gradients from storing gradients of all parameters to gradients of only one parameter.

Most of the LOMO memory usage is consistent with the memory usage of parameter-efficient fine-tuning methods, indicating that combining LOMO with these methods results in only a slight increase in gradient memory usage. This allows more parameters to be tuned for the PEFT method.

Experimental results

In the experimental part, the researchers evaluated their proposed method from three aspects, namely memory usage, throughput and downstream performance. Without further explanation, all experiments were performed using LLaMA models 7B to 65B.

Memory usage

The researchers first analyzed the model status and Activated memory usage. As shown in Table 1, compared with the AdamW optimizer, the use of the LOMO optimizer results in a significant reduction in memory usage, from 102.20GB to 14.58GB; compared with SGD, when training the LLaMA-7B model, the memory usage decreases from 51.99GB reduced to 14.58GB. The significant reduction in memory usage is primarily due to reduced memory requirements for gradients and optimizer states. Therefore, during the training process, the memory is mostly occupied by parameters, which is equivalent to the memory usage during inference.

65 billion parameters, 8 GPUs can fine-tune all parameters: Qiu Xipengs team has lowered the threshold for large models

As shown in Figure 2, if the AdamW optimizer is used for LLaMA-7B training, a considerable proportion of memory ( 73.7%) is assigned to optimizer state. Replacing the AdamW optimizer with the SGD optimizer effectively reduces the percentage of memory occupied by the optimizer state, thereby alleviating GPU memory usage (from 102.20GB to 51.99GB). If LOMO is used, parameter updates and backward are merged into a single step, further eliminating memory requirements for the optimizer state.

65 billion parameters, 8 GPUs can fine-tune all parameters: Qiu Xipengs team has lowered the threshold for large models

Throughput

Researchers compared LOMO, AdamW and SGD throughput performance. Experiments were conducted on a server equipped with 8 RTX 3090 GPUs.

For the 7B model, the throughput of LOMO shows a significant advantage, exceeding AdamW and SGD by about 11 times. This significant improvement can be attributed to LOMO's ability to train the 7B model on a single GPU, which reduces inter-GPU communication overhead. The slightly higher throughput of SGD compared to AdamW can be attributed to the fact that SGD excludes the calculation of momentum and variance.

As for the 13B model, due to memory limitations, it cannot be trained with AdamW on the existing 8 RTX 3090 GPUs. In this case, model parallelism is necessary for LOMO, which still outperforms SGD in terms of throughput. This advantage is attributed to the memory-efficient nature of LOMO and the fact that only two GPUs are required to train the model with the same settings, thus reducing communication costs and improving throughput. Additionally, SGD encountered out-of-memory (OOM) issues on 8 RTX 3090 GPUs when training the 30B model, while LOMO performed well with only 4 GPUs.

Finally, the researcher successfully trained the 65B model using 8 RTX 3090 GPUs, achieving a throughput of 4.93 TGS. With this server configuration and LOMO, the training process of the model on 1000 samples (each sample contains 512 tokens) takes approximately 3.6 hours.

Downstream Performance

To evaluate the effectiveness of LOMO in fine-tuning large language models, the researchers conducted a An extensive series of experiments. They compared LOMO with two other methods, one is Zero-shot, which does not require fine-tuning, and the other is LoRA, a popular parameter-efficient fine-tuning technique.

65 billion parameters, 8 GPUs can fine-tune all parameters: Qiu Xipengs team has lowered the threshold for large models

The results of Table 3 show:

LOMO’s performance is significantly better than Zero-shot ;
In most experiments, LOMO generally outperforms LoRA;
LOMO can effectively scale to a 65 billion parameter model.

LOMO and LoRA are essentially independent of each other. To verify this statement, the researchers conducted experiments on the BoolQ and MultiRC datasets using LLaMA-13B. The results are shown in Figure 3.

They found that LOMO continued to enhance the performance of LoRA, regardless of how high the results LoRA achieved. This shows that the different fine-tuning methods employed by LOMO and LoRA are complementary. Specifically, LOMO focuses on fine-tuning the weights of the pre-trained model, while LoRA adjusts other modules. Therefore, LOMO does not affect the performance of LoRA; instead, it facilitates better model tuning for downstream tasks.

65 billion parameters, 8 GPUs can fine-tune all parameters: Qiu Xipengs team has lowered the threshold for large models

See the original paper for more details.

The above is the detailed content of 65 billion parameters, 8 GPUs can fine-tune all parameters: Qiu Xipeng's team has lowered the threshold for large models. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

1 weeks ago By DDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Where to find the Crane Control Keycard in Atomfall

1 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7415

CakePHP Tutorial

1359

What is the format of the account name of steam

win11 activation key permanent

Related knowledge

The world's most powerful open source MoE model is here, with Chinese capabilities comparable to GPT-4, and the price is only nearly one percent of GPT-4-Turbo May 07, 2024 pm 04:13 PM

Imagine an artificial intelligence model that not only has the ability to surpass traditional computing, but also achieves more efficient performance at a lower cost. This is not science fiction, DeepSeek-V2[1], the world’s most powerful open source MoE model is here. DeepSeek-V2 is a powerful mixture of experts (MoE) language model with the characteristics of economical training and efficient inference. It consists of 236B parameters, 21B of which are used to activate each marker. Compared with DeepSeek67B, DeepSeek-V2 has stronger performance, while saving 42.5% of training costs, reducing KV cache by 93.3%, and increasing the maximum generation throughput to 5.76 times. DeepSeek is a company exploring general artificial intelligence

Hello, electric Atlas! Boston Dynamics robot comes back to life, 180-degree weird moves scare Musk Apr 18, 2024 pm 07:58 PM

Boston Dynamics Atlas officially enters the era of electric robots! Yesterday, the hydraulic Atlas just "tearfully" withdrew from the stage of history. Today, Boston Dynamics announced that the electric Atlas is on the job. It seems that in the field of commercial humanoid robots, Boston Dynamics is determined to compete with Tesla. After the new video was released, it had already been viewed by more than one million people in just ten hours. The old people leave and new roles appear. This is a historical necessity. There is no doubt that this year is the explosive year of humanoid robots. Netizens commented: The advancement of robots has made this year's opening ceremony look like a human, and the degree of freedom is far greater than that of humans. But is this really not a horror movie? At the beginning of the video, Atlas is lying calmly on the ground, seemingly on his back. What follows is jaw-dropping

KAN, which replaces MLP, has been extended to convolution by open source projects Jun 01, 2024 pm 10:03 PM

Earlier this month, researchers from MIT and other institutions proposed a very promising alternative to MLP - KAN. KAN outperforms MLP in terms of accuracy and interpretability. And it can outperform MLP running with a larger number of parameters with a very small number of parameters. For example, the authors stated that they used KAN to reproduce DeepMind's results with a smaller network and a higher degree of automation. Specifically, DeepMind's MLP has about 300,000 parameters, while KAN only has about 200 parameters. KAN has a strong mathematical foundation like MLP. MLP is based on the universal approximation theorem, while KAN is based on the Kolmogorov-Arnold representation theorem. As shown in the figure below, KAN has

Google is ecstatic: JAX performance surpasses Pytorch and TensorFlow! It may become the fastest choice for GPU inference training Apr 01, 2024 pm 07:46 PM

The performance of JAX, promoted by Google, has surpassed that of Pytorch and TensorFlow in recent benchmark tests, ranking first in 7 indicators. And the test was not done on the TPU with the best JAX performance. Although among developers, Pytorch is still more popular than Tensorflow. But in the future, perhaps more large models will be trained and run based on the JAX platform. Models Recently, the Keras team benchmarked three backends (TensorFlow, JAX, PyTorch) with the native PyTorch implementation and Keras2 with TensorFlow. First, they select a set of mainstream

AI subverts mathematical research! Fields Medal winner and Chinese-American mathematician led 11 top-ranked papers | Liked by Terence Tao Apr 09, 2024 am 11:52 AM

AI is indeed changing mathematics. Recently, Tao Zhexuan, who has been paying close attention to this issue, forwarded the latest issue of "Bulletin of the American Mathematical Society" (Bulletin of the American Mathematical Society). Focusing on the topic "Will machines change mathematics?", many mathematicians expressed their opinions. The whole process was full of sparks, hardcore and exciting. The author has a strong lineup, including Fields Medal winner Akshay Venkatesh, Chinese mathematician Zheng Lejun, NYU computer scientist Ernest Davis and many other well-known scholars in the industry. The world of AI has changed dramatically. You know, many of these articles were submitted a year ago.

Tesla robots work in factories, Musk: The degree of freedom of hands will reach 22 this year! May 06, 2024 pm 04:13 PM

The latest video of Tesla's robot Optimus is released, and it can already work in the factory. At normal speed, it sorts batteries (Tesla's 4680 batteries) like this: The official also released what it looks like at 20x speed - on a small "workstation", picking and picking and picking: This time it is released One of the highlights of the video is that Optimus completes this work in the factory, completely autonomously, without human intervention throughout the process. And from the perspective of Optimus, it can also pick up and place the crooked battery, focusing on automatic error correction: Regarding Optimus's hand, NVIDIA scientist Jim Fan gave a high evaluation: Optimus's hand is the world's five-fingered robot. One of the most dexterous. Its hands are not only tactile

FisheyeDetNet: the first target detection algorithm based on fisheye camera Apr 26, 2024 am 11:37 AM

Target detection is a relatively mature problem in autonomous driving systems, among which pedestrian detection is one of the earliest algorithms to be deployed. Very comprehensive research has been carried out in most papers. However, distance perception using fisheye cameras for surround view is relatively less studied. Due to large radial distortion, standard bounding box representation is difficult to implement in fisheye cameras. To alleviate the above description, we explore extended bounding box, ellipse, and general polygon designs into polar/angular representations and define an instance segmentation mIOU metric to analyze these representations. The proposed model fisheyeDetNet with polygonal shape outperforms other models and simultaneously achieves 49.5% mAP on the Valeo fisheye camera dataset for autonomous driving

Single card running Llama 70B is faster than dual card, Microsoft forced FP6 into A100 | Open source Apr 29, 2024 pm 04:55 PM

FP8 and lower floating point quantification precision are no longer the "patent" of H100! Lao Huang wanted everyone to use INT8/INT4, and the Microsoft DeepSpeed team started running FP6 on A100 without official support from NVIDIA. Test results show that the new method TC-FPx's FP6 quantization on A100 is close to or occasionally faster than INT4, and has higher accuracy than the latter. On top of this, there is also end-to-end large model support, which has been open sourced and integrated into deep learning inference frameworks such as DeepSpeed. This result also has an immediate effect on accelerating large models - under this framework, using a single card to run Llama, the throughput is 2.65 times higher than that of dual cards. one

See all articles