S-LoRA: It is possible to run thousands of large models on one GPU
Generally, the deployment of large language models usually adopts the "pre-training-fine-tuning" method. However, when fine-tuning the underlying model for multiple tasks (such as personalized assistants), the cost of training and serving becomes very high. LowRank Adaptation (LoRA) is an efficient parameter fine-tuning method, which is usually used to adapt the basic model to multiple tasks, thereby generating a large number of derived LoRA adapters
Rewrite: Batch inference provides many opportunities during serving, and this pattern has been shown to achieve comparable performance to full fine-tuning by fine-tuning adapter weights. While this approach enables low-latency single-adapter inference and serial execution across adapters, it significantly reduces overall service throughput and increases overall latency when serving multiple adapters simultaneously. Therefore, how to solve the large-scale service problem of these fine-tuned variants remains unknown
Recently, researchers from UC Berkeley, Stanford and other universities proposed in a paper a method called New fine-tuning method for S-LoRA
- Paper address: https://arxiv.org/pdf/2311.03285.pdf
- Project address: https://github.com/S-LoRA/S-LoRA
S -LoRA is a system designed for scalable serving of many LoRA adapters, which stores all adapters in main memory and fetches the adapter used by the currently running query into GPU memory.
S-LoRA proposes "Unified Paging" technology, which uses a unified memory pool to manage different levels of dynamic adapter weights and KV cache tensors of different sequence lengths . Additionally, S-LoRA employs a new tensor parallelism strategy and highly optimized custom CUDA kernels to enable heterogeneous batch processing of LoRA computations.
These features allow S-LoRA to serve thousands of LoRA adapters (2000 adapters simultaneously) on single or multiple GPUs at a fraction of the cost, and will Additional LoRA computation costs are minimized. In contrast, vLLM-packed requires maintaining multiple copies of weights and can only serve fewer than 5 adapters due to GPU memory limitations
Unlike HuggingFace PEFT and vLLM (only Compared with state-of-the-art libraries such as the LoRA service), S-LoRA can increase throughput by up to 4 times, and the number of service adapters can be increased by several orders of magnitude. Therefore, S-LoRA is able to provide scalable services for many task-specific fine-tuning models and offers the potential for large-scale customization of fine-tuning services.
S-LoRA contains three main innovative parts. Section 4 introduces the batching strategy used to decompose the calculations between the base model and the LoRA adapter. In addition, the researchers also solved the challenges of demand scheduling, including aspects such as adapter clustering and admission control. The ability to batch across concurrent adapters brings new challenges to memory management. In the fifth part, researchers promote PagedAttention to Unfied Paging to support dynamic loading of LoRA adapters. This approach uses a unified memory pool to store the KV cache and adapter weights in a paged manner, which can reduce fragmentation and balance the dynamically changing sizes of the KV cache and adapter weights. Finally, Part Six introduces a new tensor parallel strategy that can efficiently decouple the base model and the LoRA adapter
The following is the key content:
Batch Processing
For a single adapter, Hu et al. (2021) proposed a recommended method, which is to merge the adapter weights with the base model weights to obtain a new model ( See equation 1). The benefit of this is that there is no additional adapter overhead during inference since the new model has the same number of parameters as the base model. In fact, this was also a distinctive feature of the initial LoRA work
##This article points out that merging the LoRA adapter into the base model Medium is inefficient for multi-LoRA high-throughput service setups. Instead, the researchers propose to calculate LoRA in real time to calculate xAB (as shown in Equation 2).
In S-LoRA, computing the base model is batched and then additional xAB is performed for all adapters individually using a custom CUDA kernel. This process is shown in Figure 1. Instead of using padding and batched GEMM kernels from the BLAS library to compute LoRA, we implemented a custom CUDA kernel to achieve more efficient computation without padding, implementation details are in subsection 5.3.
If LoRA adapters are stored in main memory, their number can be large, but the number of LoRA adapters currently required to run a batch is Controllable since batch size is limited by GPU memory. To take advantage of this, we store all LoRA adapters in main memory and, when inferring for the currently running batch, fetch only the LoRA adapters required for that batch into GPU RAM. . In this case, the maximum number of serviceable adapters is limited by the main memory size. Figure 2 illustrates this process. Section 5 also discusses techniques for efficient memory management
Memory management
with a single base Compared with servicing models, servicing multiple LoRA adapter cards simultaneously will bring new memory management challenges. To support multiple adapters, S-LoRA stores them in main memory and dynamically loads the adapter weights required for the current running batch into GPU RAM.
In this process, there are two obvious challenges. The first is the memory fragmentation issue, which is caused by dynamically loading and unloading adapter weights of different sizes. The second is the latency overhead caused by adapter loading and unloading. In order to effectively solve these problems, researchers proposed the concept of "unified paging" and realized the overlap of I/O and calculation by prefetching adapter weights
Unified Paging
Researchers expanded the concept of PagedAttention into Unified Paging. Unified paging is used not only to manage KV cache, but also to manage adapter weights. Unified paging uses a unified memory pool to jointly manage KV cache and adapter weights. To achieve this, they first statically allocate a large buffer to the memory pool, which utilizes all available space except for the space used to store the base model weights and temporary activation tensors. The KV cache and adapter weights are stored in the memory pool in a paged manner, and each page corresponds to an H vector. Therefore, a KV cache tensor with sequence length S occupies S pages, while an R-level LoRA weight tensor occupies R pages. Figure 3 shows the layout of the memory pool, where the KV cache and adapter weights are stored in an interleaved and non-contiguous manner. This approach greatly reduces fragmentation and ensures that different levels of adapter weights can co-exist with the dynamic KV cache in a structured and systematic way
Tensor Parallel
In addition, the researchers designed a novel tensor parallel strategy for batch LoRA inference to support multi-GPU inference of large Transformer models. Tensor parallelism is the most widely used parallel approach because its single-program, multiple-data paradigm simplifies its implementation and integration with existing systems. Tensor parallelism can reduce memory usage and latency per GPU when serving large models. In this setting, additional LoRA adapters introduce new weight matrices and matrix multiplications, which require new partitioning strategies for these additions.
Evaluation
Finally, the researchers passed the test for Llama-7B/13B/30B/70B Serving to evaluate S-LoRA
The results show that S-LoRA can serve thousands of LoRA adapters on a single GPU or multiple GPUs , and the overhead is very small. S-LoRA achieves up to 30x higher throughput compared to Huggingface PEFT, a state-of-the-art parameter-efficient fine-tuning library. S-LoRA increases throughput by 4x and increases the number of service adapters by several orders of magnitude compared to using a high-throughput service system vLLM that supports LoRA services.
For more research details, please refer to the original paper.
The above is the detailed content of S-LoRA: It is possible to run thousands of large models on one GPU. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

DDREASE is a tool for recovering data from file or block devices such as hard drives, SSDs, RAM disks, CDs, DVDs and USB storage devices. It copies data from one block device to another, leaving corrupted data blocks behind and moving only good data blocks. ddreasue is a powerful recovery tool that is fully automated as it does not require any interference during recovery operations. Additionally, thanks to the ddasue map file, it can be stopped and resumed at any time. Other key features of DDREASE are as follows: It does not overwrite recovered data but fills the gaps in case of iterative recovery. However, it can be truncated if the tool is instructed to do so explicitly. Recover data from multiple files or blocks to a single

0.What does this article do? We propose DepthFM: a versatile and fast state-of-the-art generative monocular depth estimation model. In addition to traditional depth estimation tasks, DepthFM also demonstrates state-of-the-art capabilities in downstream tasks such as depth inpainting. DepthFM is efficient and can synthesize depth maps within a few inference steps. Let’s read about this work together ~ 1. Paper information title: DepthFM: FastMonocularDepthEstimationwithFlowMatching Author: MingGui, JohannesS.Fischer, UlrichPrestel, PingchuanMa, Dmytr

Boston Dynamics Atlas officially enters the era of electric robots! Yesterday, the hydraulic Atlas just "tearfully" withdrew from the stage of history. Today, Boston Dynamics announced that the electric Atlas is on the job. It seems that in the field of commercial humanoid robots, Boston Dynamics is determined to compete with Tesla. After the new video was released, it had already been viewed by more than one million people in just ten hours. The old people leave and new roles appear. This is a historical necessity. There is no doubt that this year is the explosive year of humanoid robots. Netizens commented: The advancement of robots has made this year's opening ceremony look like a human, and the degree of freedom is far greater than that of humans. But is this really not a horror movie? At the beginning of the video, Atlas is lying calmly on the ground, seemingly on his back. What follows is jaw-dropping

The performance of JAX, promoted by Google, has surpassed that of Pytorch and TensorFlow in recent benchmark tests, ranking first in 7 indicators. And the test was not done on the TPU with the best JAX performance. Although among developers, Pytorch is still more popular than Tensorflow. But in the future, perhaps more large models will be trained and run based on the JAX platform. Models Recently, the Keras team benchmarked three backends (TensorFlow, JAX, PyTorch) with the native PyTorch implementation and Keras2 with TensorFlow. First, they select a set of mainstream

I cry to death. The world is madly building big models. The data on the Internet is not enough. It is not enough at all. The training model looks like "The Hunger Games", and AI researchers around the world are worrying about how to feed these data voracious eaters. This problem is particularly prominent in multi-modal tasks. At a time when nothing could be done, a start-up team from the Department of Renmin University of China used its own new model to become the first in China to make "model-generated data feed itself" a reality. Moreover, it is a two-pronged approach on the understanding side and the generation side. Both sides can generate high-quality, multi-modal new data and provide data feedback to the model itself. What is a model? Awaker 1.0, a large multi-modal model that just appeared on the Zhongguancun Forum. Who is the team? Sophon engine. Founded by Gao Yizhao, a doctoral student at Renmin University’s Hillhouse School of Artificial Intelligence.

Facing lag, slow mobile data connection on iPhone? Typically, the strength of cellular internet on your phone depends on several factors such as region, cellular network type, roaming type, etc. There are some things you can do to get a faster, more reliable cellular Internet connection. Fix 1 – Force Restart iPhone Sometimes, force restarting your device just resets a lot of things, including the cellular connection. Step 1 – Just press the volume up key once and release. Next, press the Volume Down key and release it again. Step 2 – The next part of the process is to hold the button on the right side. Let the iPhone finish restarting. Enable cellular data and check network speed. Check again Fix 2 – Change data mode While 5G offers better network speeds, it works better when the signal is weaker

What? Is Zootopia brought into reality by domestic AI? Exposed together with the video is a new large-scale domestic video generation model called "Keling". Sora uses a similar technical route and combines a number of self-developed technological innovations to produce videos that not only have large and reasonable movements, but also simulate the characteristics of the physical world and have strong conceptual combination capabilities and imagination. According to the data, Keling supports the generation of ultra-long videos of up to 2 minutes at 30fps, with resolutions up to 1080p, and supports multiple aspect ratios. Another important point is that Keling is not a demo or video result demonstration released by the laboratory, but a product-level application launched by Kuaishou, a leading player in the short video field. Moreover, the main focus is to be pragmatic, not to write blank checks, and to go online as soon as it is released. The large model of Ke Ling is already available in Kuaiying.

Recently, the military circle has been overwhelmed by the news: US military fighter jets can now complete fully automatic air combat using AI. Yes, just recently, the US military’s AI fighter jet was made public for the first time and the mystery was unveiled. The full name of this fighter is the Variable Stability Simulator Test Aircraft (VISTA). It was personally flown by the Secretary of the US Air Force to simulate a one-on-one air battle. On May 2, U.S. Air Force Secretary Frank Kendall took off in an X-62AVISTA at Edwards Air Force Base. Note that during the one-hour flight, all flight actions were completed autonomously by AI! Kendall said - "For the past few decades, we have been thinking about the unlimited potential of autonomous air-to-air combat, but it has always seemed out of reach." However now,
