Large-scale language models (LLMs) have demonstrated compelling capabilities in many important tasks, including natural language understanding, language generation, and complex reasoning, and have had a profound impact on society. However, these outstanding capabilities require significant training resources (shown in the left image) and long inference times (shown in the right image). Therefore, researchers need to develop effective technical means to solve their efficiency problems.
In addition, as can be seen from the right side of the figure, some efficient LLMs (Language Models) such as Mistral-7B have been successfully used in the design and deployment of LLMs. These efficient LLMs can greatly reduce inference memory usage and reduce inference latency while maintaining accuracy similar to LLaMA1-33B. This shows that there are already some feasible and efficient methods that have been successfully applied to the design and use of LLMs.
In this review, experts from Ohio State University, Imperial College, Michigan State University, University of Michigan, Amazon, Google, Boson AI, Researchers at Microsoft Asia Research provide a systematic and comprehensive survey of research into efficient LLMs. They divided existing technologies for optimizing the efficiency of LLMs into three categories, including model-centric, data-centric and framework-centric, and summarized and discussed the most cutting-edge related technologies.
In order to conveniently organize the papers involved in the review and keep them updated, the researcher created a GitHub repository and actively maintains it. They hope that this repository will help researchers and practitioners systematically understand the research and development of efficient LLMs and inspire them to contribute to this important and exciting field.
The URL of the warehouse is https://github.com/AIoT-MLSys-Lab/Efficient-LLMs-Survey. In this repository you can find content related to a survey of efficient and low-power machine learning systems. This repository provides research papers, code, and documentation to help people better understand and explore efficient and low-power machine learning systems. If you are interested in this area, you can get more information by visiting this repository.
A model-centric approach focuses on efficient techniques at the algorithm level and system level, where the model itself is the focus. Since LLMs have billions or even trillions of parameters and have unique characteristics such as emergence compared to smaller-scale models, new techniques need to be developed to optimize the efficiency of LLMs. This article discusses five categories of model-centric methods in detail, including model compression, efficient pre-training, efficient fine-tuning, efficient inference and efficient model architecture design.
1. Compression model In the field of machine learning, model size is often an important consideration. Larger models often require more storage space and computing resources, and may encounter limitations when running on mobile devices. Therefore, model compression is a commonly used technology that can reduce the size of the model
Model compression technology is mainly divided into four categories: quantization, parameter pruning, and low-rank estimation and knowledge distillation (see the figure below), in which quantization will compress the weights or activation values of the model from high precision to low precision, parameter pruning will search and delete the more redundant parts of the model weights, and low-rank estimation will reduce the model's weights. The weight matrix is converted into the product of several low-rank small matrices, and knowledge distillation directly uses the large model to train the small model, so that the small model has the ability to replace the large model when doing certain tasks.
2. Efficient pre-training
The cost of pre-training LLMs is very expensive. Efficient pre-training aims to improve efficiency and reduce the cost of the LLMs pre-training process. Efficient pre-training can be divided into mixed precision acceleration, model scaling, initialization technology, optimization strategy and system-level acceleration.
Mixed-precision acceleration improves the efficiency of pre-training by calculating gradients, weights, and activations using low-precision weights, which are then converted back to high-precision and applied to update the original weights. Model scaling accelerates pre-training convergence and reduces training costs by using the parameters of small models to scale to large models. Initialization technology speeds up the convergence of the model by designing the initialization value of the model. The optimization strategy focuses on designing lightweight optimizers to reduce memory consumption during model training. System-level acceleration uses distributed and other technologies to accelerate model pre-training from the system level.
3. Efficient fine-tuning
Efficient fine-tuning aims to improve LLMs Fine-tuning process efficiency. Common efficient fine-tuning technologies are divided into two categories, one is parameter-based efficient fine-tuning, and the other is memory-efficient fine-tuning.
The goal of parameter-based efficient fine-tuning (PEFT) is to tune LLM to downstream tasks by freezing the entire LLM backbone and updating only a small set of additional parameters. In the paper, we further divided PEFT into adapter-based fine-tuning, low-rank adaptation, prefix fine-tuning and prompt word fine-tuning.
Efficient memory-based fine-tuning focuses on reducing memory consumption during the entire LLM fine-tuning process, such as reducing the memory consumed by optimizer status and activation values.
4. Efficient reasoning
Efficient reasoning aims to improve LLMs reasoning process efficiency. Researchers divide common high-efficiency reasoning technologies into two categories, one is algorithm-level reasoning acceleration, and the other is system-level reasoning acceleration.
Inference acceleration at the algorithm level can be divided into two categories: speculative decoding and KV - cache optimization. Speculative decoding speeds up the sampling process by computing tokens in parallel using a smaller draft model to create speculative prefixes for the larger target model. KV - Cache optimization refers to optimizing the repeated calculation of Key-Value (KV) pairs during the inference process of LLMs.
System-level inference acceleration is to optimize the number of memory accesses on specified hardware, increase the amount of algorithm parallelism, etc. to accelerate LLM inference.
5. Efficient model architecture design
Efficient architecture design for LLMs refers to strategically optimizing the model structure and calculation process to improve performance and scalability while minimizing resource consumption. We divide efficient model architecture design into four major categories based on model types: efficient attention modules, hybrid expert models, long text large models, and architectures that can replace transformers.
The efficient attention module aims to optimize the complex calculations and memory usage in the attention module, while the mixed expert model (MoE) uses multiple reasoning decisions in some modules of LLMs. A small expert model is used as a replacement to achieve overall sparsity. The long text large model is an LLMs specially designed to efficiently process ultra-long text. The architecture that can replace the transformer is to reduce the complexity of the model and reduce the complexity of the model by redesigning the model architecture. Achieving comparable reasoning capabilities for post-transformer architectures.
A data-centric approach focuses on the quality and structure of data in role in improving the efficiency of LLMs. In this article, researchers discuss two types of data-centric methods in detail, includingdata selection and prompt word engineering.
1. Data selection
The data selection of LLMs is aimed at pre-training/ Fine-tune data for cleaning and selection, such as removing redundant and invalid data, to speed up the training process.
2. Prompt Word Project
Prompt Word Project guides LLMs by designing effective inputs (prompt words) The efficiency of generating the desired output is that the prompt words can be designed to achieve model performance equivalent to that of tedious fine-tuning. Researchers divide common prompt word engineering technologies into three major categories: few-sample prompt word engineering, prompt word compression, and prompt word generation.
The few-sample prompt word project provides LLM with a limited set of examples to guide its understanding of the tasks that need to be performed. Prompt word compression accelerates LLMs' processing of input by compressing lengthy prompt input or learning and using prompt representations. Prompt word generation aims to automatically create effective prompts that guide the model to generate specific and relevant responses, rather than using manually annotated data.
The researcher investigated The recently popular efficient LLMs framework lists the efficient tasks they can optimize, including pre-training, fine-tuning and inference (as shown in the figure below).
In this survey, the researcher provides everyone with a A systematic review of LLMs, an important research area dedicated to making LLMs more democratized. They begin by explaining why efficient LLMs are needed. Under an orderly framework, this paper investigates efficient technologies at the algorithmic level and system level of LLMs from the model-centered, data-centered, and framework-centered perspectives respectively.
Researchers believe that efficiency will play an increasingly important role in LLMs and LLMs-oriented systems. They hope that this survey will help researchers and practitioners quickly enter this field and serve as a catalyst to stimulate new research on efficient LLMs.
The above is the detailed content of A deep dive into models, data, and frameworks: an exhaustive 54-page review of efficient large language models. For more information, please follow other related articles on the PHP Chinese website!