With less than 1,000 lines of code, the PyTorch team made Llama 7B 10 times faster-AI-php.cn

Home

Technology peripherals

With less than 1,000 lines of code, the PyTorch team made Llama 7B 10 times faster

PHPz

Dec 05, 2023 pm 03:14 PM

getting Started pytorch

The PyTorch team personally teaches you how to accelerate large model inference.

In the past year, generative AI has developed rapidly. Among them, text generation has been a particularly popular field. Many Open source projects such as llama.cpp, vLLM, MLC-LLM, etc. are constantly being optimized in order to achieve better results.

As one of the most popular frameworks in the machine learning community, PyTorch has naturally seized this new opportunity and continuously optimized it. In order to let everyone better understand these innovations, the PyTorch team has specially set up a series of blogs to focus on how to use pure native PyTorch to accelerate generative AI models.

不到1000行代码，PyTorch团队让Llama 7B提速10倍

Code address: https://github.com/pytorch-labs/gpt-fast

In the In a blog, the PyTorch team demonstrated how to rewrite the Segment Anything (SAM) model using only pure native PyTorch, which is 8 times faster than the original implementation. In this blog, they bring us something new, namely how to speed up LLM inference.

Let’s take a look at the results first. The team rewrote LLM, and the inference speed was 10 times faster than the baseline, without losing accuracy and using less than 1000 lines of pure native PyTorch code!

不到1000行代码，PyTorch团队让Llama 7B提速10倍

All benchmarks were run on the A100-80GB, which is limited to 330W.

These optimizations include:

Next, let’s see how each step is implemented.

6 Steps to speed up large model inference

The study shows that without optimization , the inference performance of the large model is 25.5 tok/s, and the effect is not very good:

不到1000行代码，PyTorch团队让Llama 7B提速10倍 After some exploration, I finally found the reason: excessive CPU overhead. Then there is the following 6-step optimization process.

不到1000行代码，PyTorch团队让Llama 7B提速10倍

Step one: Reduce CPU overhead through Torch.compile and static KV cache to achieve 107.0 TOK/S

torch.compile allows users to capture larger areas into a single compilation area, especially when mode="reduce-overhead" (refer to the code below), this feature is very useful for reducing CPU overhead. Effective. In addition, this article also specifies fullgraph=True to verify that there is no "graph interruption" in the model (that is, the part that torch.compile cannot compile).

不到1000行代码，PyTorch团队让Llama 7B提速10倍 #However, even with the blessing of torch.compile, there are still some obstacles.

The first hurdle is the kv cache. That is, when the user generates more tokens, the "logical length" of the kv cache will grow. This problem arises for two reasons: first, it is very expensive to reallocate (and copy) the kv cache every time the cache grows; second, this dynamic allocation makes it more difficult to reduce the overhead.

In order to solve this problem, this article uses a static KV cache, statically allocates the size of the KV cache, and then masks out unused values in the attention mechanism.

The second obstacle is the prefill stage. Text generation with Transformer can be viewed as a two-stage process: 1. Prefill stage to process the entire prompt 2. Decode the token.

Although the kv cache is set to static ization, but the prefill phase still requires more dynamics due to variable prompt lengths. Therefore, separate compilation strategies need to be used to compile these two stages.

不到1000行代码，PyTorch团队让Llama 7B提速10倍

While these details are a bit tricky, they are not difficult to implement and the performance improvements are huge. After this operation, the performance increased by more than 4 times, from 25 tok/s to 107 tok/s.

不到1000行代码，PyTorch团队让Llama 7B提速10倍

The second step: alleviate the memory bandwidth bottleneck through int8 weight quantization to achieve 157.4 tok /s

Through the above, we have seen the huge acceleration brought by applying torch.compile, static kv cache, etc., but the PyTorch team is not satisfied with this, and they have found other angles for optimization.

They believe that the biggest bottleneck in accelerating generative AI training is the cost of loading weights from GPU global memory into registers. In other words, each forward pass needs to "touch" every parameter on the GPU. So, how fast can we theoretically "access" every parameter in the model?

不到1000行代码，PyTorch团队让Llama 7B提速10倍

To measure this, this article uses Model Bandwidth Utilization (MBU), which is very simple to calculate as follows:

不到1000行代码，PyTorch团队让Llama 7B提速10倍

For example, for a 7B parameter model, each parameter is stored in fp16 (2 bytes per parameter), 107 tokens/s can be achieved. The A100-80GB has a theoretical memory bandwidth of 2 TB/s.

As shown in the figure below, by putting the above formula into specific values, you can get an MBU of 72%! This result is quite good, because many studies have difficulty breaking through 85%.

不到1000行代码，PyTorch团队让Llama 7B提速10倍

But the PyTorch team also wants to increase this value. They found that they could not change the number of parameters in the model, nor could they change the memory bandwidth of the GPU. But they discovered that they could change the number of bytes stored for each parameter!

不到1000行代码，PyTorch团队让Llama 7B提速10倍

So they intend to use int8 quantization.

不到1000行代码，PyTorch团队让Llama 7B提速10倍

Please note that this is only the quantized weights, the calculation itself is still done in bf16. Furthermore, with torch.compile, it is easy to generate efficient code for int8 quantization.

不到1000行代码，PyTorch团队让Llama 7B提速10倍

As shown in the picture above, it can be seen from the dark blue line (torch.compile int8) that using torch.compile There is a significant performance improvement when weight-only quantization is int8.

Applying int8 quantization to the Llama-7B model improves performance by about 50%, reaching 157.4 tokens/s.

不到1000行代码，PyTorch团队让Llama 7B提速10倍

Step 3: Use Speculative Decoding

Even after using After int8 quantization and other technologies, the team still faced another problem, that is, in order to generate 100 tokens, the weights must be loaded 100 times.

不到1000行代码，PyTorch团队让Llama 7B提速10倍

Even if the weights are quantized, loading the weights over and over again is unavoidable. How to solve this problem? It turns out that leveraging speculative decoding can break this strict serial dependency and gain speedup.

不到1000行代码，PyTorch团队让Llama 7B提速10倍

This study uses the draft model to generate 8 tokens, and then uses the validator model to process them in parallel, discarding unmatched tokens. This process breaks serial dependencies. The entire implementation takes about 50 lines of native PyTorch code.

不到1000行代码，PyTorch团队让Llama 7B提速10倍

Step 4: Use int4 quantization and GPTQ methods to further reduce the weight and achieve 202.1 tok/s

This article found that when the weight is 4-bits, the accuracy of the model begins to decrease.

不到1000行代码，PyTorch团队让Llama 7B提速10倍

In order to solve this problem, this article uses two techniques to solve it: the first is to have a more fine-grained scaling factor; the other is to use a more advanced quantization strategy . Combining these operations together, we get this:

不到1000行代码，PyTorch团队让Llama 7B提速10倍

Step 5: Combining everything together, we get 244.7 tok/s

Finally, combining all techniques together for better performance, we get 244.7 tok/s.

不到1000行代码，PyTorch团队让Llama 7B提速10倍

Step Six: Tensor Parallelism

So far, this article has been is to minimize latency on a single GPU. In fact, it is also possible to use multiple GPUs, so that the latency will be further improved.

Fortunately, the PyTorch team provides low-level tools for tensor parallelism that only require 150 lines of code and do not require any model changes.

不到1000行代码，PyTorch团队让Llama 7B提速10倍

All of the previously mentioned optimizations can continue to be combined with tensor parallelism, and combined these can achieve 55 tokens/s for the Llama-70B model Provides int8 quantization.

不到1000行代码，PyTorch团队让Llama 7B提速10倍

Finally, briefly summarize the main content of the article. On Llama-7B, this article uses the "compile int4 quant speculative decoding" combination to achieve 240 tok/s. On Llama-70B, this paper also introduces tensor parallelism to achieve about 80 tok/s, which are close to or exceed SOTA performance.

^{Original link: https://pytorch.org/blog/accelerating-generative-ai-2/}

The above is the detailed content of With less than 1,000 lines of code, the PyTorch team made Llama 7B 10 times faster. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

2 weeks ago By DDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Chat Commands and How to Use Them

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7519

CakePHP Tutorial

1378

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

Related knowledge

A Diffusion Model Tutorial Worth Your Time, from Purdue University Apr 07, 2024 am 09:01 AM

Diffusion can not only imitate better, but also "create". The diffusion model (DiffusionModel) is an image generation model. Compared with the well-known algorithms such as GAN and VAE in the field of AI, the diffusion model takes a different approach. Its main idea is a process of first adding noise to the image and then gradually denoising it. How to denoise and restore the original image is the core part of the algorithm. The final algorithm is able to generate an image from a random noisy image. In recent years, the phenomenal growth of generative AI has enabled many exciting applications in text-to-image generation, video generation, and more. The basic principle behind these generative tools is the concept of diffusion, a special sampling mechanism that overcomes the limitations of previous methods.

Generate PPT with one click! Kimi: Let the 'PPT migrant workers' become popular first Aug 01, 2024 pm 03:28 PM

Kimi: In just one sentence, in just ten seconds, a PPT will be ready. PPT is so annoying! To hold a meeting, you need to have a PPT; to write a weekly report, you need to have a PPT; to make an investment, you need to show a PPT; even when you accuse someone of cheating, you have to send a PPT. College is more like studying a PPT major. You watch PPT in class and do PPT after class. Perhaps, when Dennis Austin invented PPT 37 years ago, he did not expect that one day PPT would become so widespread. Talking about our hard experience of making PPT brings tears to our eyes. "It took three months to make a PPT of more than 20 pages, and I revised it dozens of times. I felt like vomiting when I saw the PPT." "At my peak, I did five PPTs a day, and even my breathing was PPT." If you have an impromptu meeting, you should do it

The perfect combination of PyCharm and PyTorch: detailed installation and configuration steps Feb 21, 2024 pm 12:00 PM

PyCharm is a powerful integrated development environment (IDE), and PyTorch is a popular open source framework in the field of deep learning. In the field of machine learning and deep learning, using PyCharm and PyTorch for development can greatly improve development efficiency and code quality. This article will introduce in detail how to install and configure PyTorch in PyCharm, and attach specific code examples to help readers better utilize the powerful functions of these two. Step 1: Install PyCharm and Python

Introduction to five sampling methods in natural language generation tasks and Pytorch code implementation Feb 20, 2024 am 08:50 AM

In natural language generation tasks, sampling method is a technique to obtain text output from a generative model. This article will discuss 5 common methods and implement them using PyTorch. 1. GreedyDecoding In greedy decoding, the generative model predicts the words of the output sequence based on the input sequence time step by time. At each time step, the model calculates the conditional probability distribution of each word, and then selects the word with the highest conditional probability as the output of the current time step. This word becomes the input to the next time step, and the generation process continues until some termination condition is met, such as a sequence of a specified length or a special end marker. The characteristic of GreedyDecoding is that each time the current conditional probability is the best

Tutorial on installing PyCharm with PyTorch Feb 24, 2024 am 10:09 AM

As a powerful deep learning framework, PyTorch is widely used in various machine learning projects. As a powerful Python integrated development environment, PyCharm can also provide good support when implementing deep learning tasks. This article will introduce in detail how to install PyTorch in PyCharm and provide specific code examples to help readers quickly get started using PyTorch for deep learning tasks. Step 1: Install PyCharm First, we need to make sure we have

All CVPR 2024 awards announced! Nearly 10,000 people attended the conference offline, and a Chinese researcher from Google won the best paper award Jun 20, 2024 pm 05:43 PM

In the early morning of June 20th, Beijing time, CVPR2024, the top international computer vision conference held in Seattle, officially announced the best paper and other awards. This year, a total of 10 papers won awards, including 2 best papers and 2 best student papers. In addition, there were 2 best paper nominations and 4 best student paper nominations. The top conference in the field of computer vision (CV) is CVPR, which attracts a large number of research institutions and universities every year. According to statistics, a total of 11,532 papers were submitted this year, and 2,719 were accepted, with an acceptance rate of 23.6%. According to Georgia Institute of Technology’s statistical analysis of CVPR2024 data, from the perspective of research topics, the largest number of papers is image and video synthesis and generation (Imageandvideosyn

Five programming software for getting started with learning C language Feb 19, 2024 pm 04:51 PM

As a widely used programming language, C language is one of the basic languages that must be learned for those who want to engage in computer programming. However, for beginners, learning a new programming language can be difficult, especially due to the lack of relevant learning tools and teaching materials. In this article, I will introduce five programming software to help beginners get started with C language and help you get started quickly. The first programming software was Code::Blocks. Code::Blocks is a free, open source integrated development environment (IDE) for

so fast! Recognize video speech into text in just a few minutes with less than 10 lines of code Feb 27, 2024 pm 01:55 PM

Hello everyone, I am Kite. Two years ago, the need to convert audio and video files into text content was difficult to achieve, but now it can be easily solved in just a few minutes. It is said that in order to obtain training data, some companies have fully crawled videos on short video platforms such as Douyin and Kuaishou, and then extracted the audio from the videos and converted them into text form to be used as training corpus for big data models. If you need to convert a video or audio file to text, you can try this open source solution available today. For example, you can search for the specific time points when dialogues in film and television programs appear. Without further ado, let’s get to the point. Whisper is OpenAI’s open source Whisper. Of course it is written in Python. It only requires a few simple installation packages.

See all articles