Generally, the deployment of large language models usually adopts the "pre-training-fine-tuning" method. However, when fine-tuning the underlying model for multiple tasks (such as personalized assistants), the cost of training and serving becomes very high. LowRank Adaptation (LoRA) is an efficient parameter fine-tuning method, which is usually used to adapt the basic model to multiple tasks, thereby generating a large number of derived LoRA adapters
Rewrite: Batch inference provides many opportunities during serving, and this pattern has been shown to achieve comparable performance to full fine-tuning by fine-tuning adapter weights. While this approach enables low-latency single-adapter inference and serial execution across adapters, it significantly reduces overall service throughput and increases overall latency when serving multiple adapters simultaneously. Therefore, how to solve the large-scale service problem of these fine-tuned variants remains unknown
Recently, researchers from UC Berkeley, Stanford and other universities proposed in a paper a method called New fine-tuning method for S-LoRA
S -LoRA is a system designed for scalable serving of many LoRA adapters, which stores all adapters in main memory and fetches the adapter used by the currently running query into GPU memory.
S-LoRA proposes "Unified Paging" technology, which uses a unified memory pool to manage different levels of dynamic adapter weights and KV cache tensors of different sequence lengths . Additionally, S-LoRA employs a new tensor parallelism strategy and highly optimized custom CUDA kernels to enable heterogeneous batch processing of LoRA computations.
These features allow S-LoRA to serve thousands of LoRA adapters (2000 adapters simultaneously) on single or multiple GPUs at a fraction of the cost, and will Additional LoRA computation costs are minimized. In contrast, vLLM-packed requires maintaining multiple copies of weights and can only serve fewer than 5 adapters due to GPU memory limitations
Unlike HuggingFace PEFT and vLLM (only Compared with state-of-the-art libraries such as the LoRA service), S-LoRA can increase throughput by up to 4 times, and the number of service adapters can be increased by several orders of magnitude. Therefore, S-LoRA is able to provide scalable services for many task-specific fine-tuning models and offers the potential for large-scale customization of fine-tuning services.
S-LoRA contains three main innovative parts. Section 4 introduces the batching strategy used to decompose the calculations between the base model and the LoRA adapter. In addition, the researchers also solved the challenges of demand scheduling, including aspects such as adapter clustering and admission control. The ability to batch across concurrent adapters brings new challenges to memory management. In the fifth part, researchers promote PagedAttention to Unfied Paging to support dynamic loading of LoRA adapters. This approach uses a unified memory pool to store the KV cache and adapter weights in a paged manner, which can reduce fragmentation and balance the dynamically changing sizes of the KV cache and adapter weights. Finally, Part Six introduces a new tensor parallel strategy that can efficiently decouple the base model and the LoRA adapter
The following is the key content:
For a single adapter, Hu et al. (2021) proposed a recommended method, which is to merge the adapter weights with the base model weights to obtain a new model ( See equation 1). The benefit of this is that there is no additional adapter overhead during inference since the new model has the same number of parameters as the base model. In fact, this was also a distinctive feature of the initial LoRA work
##This article points out that merging the LoRA adapter into the base model Medium is inefficient for multi-LoRA high-throughput service setups. Instead, the researchers propose to calculate LoRA in real time to calculate xAB (as shown in Equation 2).
In S-LoRA, computing the base model is batched and then additional xAB is performed for all adapters individually using a custom CUDA kernel. This process is shown in Figure 1. Instead of using padding and batched GEMM kernels from the BLAS library to compute LoRA, we implemented a custom CUDA kernel to achieve more efficient computation without padding, implementation details are in subsection 5.3.
If LoRA adapters are stored in main memory, their number can be large, but the number of LoRA adapters currently required to run a batch is Controllable since batch size is limited by GPU memory. To take advantage of this, we store all LoRA adapters in main memory and, when inferring for the currently running batch, fetch only the LoRA adapters required for that batch into GPU RAM. . In this case, the maximum number of serviceable adapters is limited by the main memory size. Figure 2 illustrates this process. Section 5 also discusses techniques for efficient memory management
with a single base Compared with servicing models, servicing multiple LoRA adapter cards simultaneously will bring new memory management challenges. To support multiple adapters, S-LoRA stores them in main memory and dynamically loads the adapter weights required for the current running batch into GPU RAM.
In this process, there are two obvious challenges. The first is the memory fragmentation issue, which is caused by dynamically loading and unloading adapter weights of different sizes. The second is the latency overhead caused by adapter loading and unloading. In order to effectively solve these problems, researchers proposed the concept of "unified paging" and realized the overlap of I/O and calculation by prefetching adapter weights
Unified Paging
Researchers expanded the concept of PagedAttention into Unified Paging. Unified paging is used not only to manage KV cache, but also to manage adapter weights. Unified paging uses a unified memory pool to jointly manage KV cache and adapter weights. To achieve this, they first statically allocate a large buffer to the memory pool, which utilizes all available space except for the space used to store the base model weights and temporary activation tensors. The KV cache and adapter weights are stored in the memory pool in a paged manner, and each page corresponds to an H vector. Therefore, a KV cache tensor with sequence length S occupies S pages, while an R-level LoRA weight tensor occupies R pages. Figure 3 shows the layout of the memory pool, where the KV cache and adapter weights are stored in an interleaved and non-contiguous manner. This approach greatly reduces fragmentation and ensures that different levels of adapter weights can co-exist with the dynamic KV cache in a structured and systematic way
In addition, the researchers designed a novel tensor parallel strategy for batch LoRA inference to support multi-GPU inference of large Transformer models. Tensor parallelism is the most widely used parallel approach because its single-program, multiple-data paradigm simplifies its implementation and integration with existing systems. Tensor parallelism can reduce memory usage and latency per GPU when serving large models. In this setting, additional LoRA adapters introduce new weight matrices and matrix multiplications, which require new partitioning strategies for these additions.
Finally, the researchers passed the test for Llama-7B/13B/30B/70B Serving to evaluate S-LoRA
The results show that S-LoRA can serve thousands of LoRA adapters on a single GPU or multiple GPUs , and the overhead is very small. S-LoRA achieves up to 30x higher throughput compared to Huggingface PEFT, a state-of-the-art parameter-efficient fine-tuning library. S-LoRA increases throughput by 4x and increases the number of service adapters by several orders of magnitude compared to using a high-throughput service system vLLM that supports LoRA services.
For more research details, please refer to the original paper.
The above is the detailed content of S-LoRA: It is possible to run thousands of large models on one GPU. For more information, please follow other related articles on the PHP Chinese website!