Low-rank adaptation of large models is a method of reducing complexity by approximating the high-dimensional structure of large models with low-dimensional structures. The aim is to create a smaller, more manageable model representation that still maintains good performance. In many tasks, redundant or irrelevant information may exist in the high-dimensional structure of large models. By identifying and removing these redundancies, a more efficient model can be created while maintaining original performance, and can use fewer resources to train and deploy.
Low-rank adaptation is a method that can speed up the training of large models while also reducing memory consumption. Its principle is to freeze the weights of the pre-trained model and introduce the trainable rank decomposition matrix into each layer of the Transformer architecture, thereby significantly reducing the number of trainable parameters for downstream tasks. This method is implemented by decomposing the original matrix into the product of two matrices of different ranks. By simply using low-rank matrices for calculations, you can reduce the number of model parameters, increase training speed, and perform well in terms of model quality without increasing inference latency.
Taking the GPT-3 model as an example, low-rank adaptation of large models (LoRA) is a method to indirectly train neural networks by optimizing the rank decomposition matrix in dense layers. some dense layer methods. The advantage of LoRA is that only some parameters need to be fine-tuned instead of training the entire model with full parameters, thus improving operational efficiency during deployment. In the GPT-3 model, LoRA only needs to optimize a very low-rank decomposition matrix to achieve performance comparable to full parameter fine-tuning. This method is not only very efficient in terms of storage and calculation, but also can effectively reduce over-fitting problems and improve the generalization ability of the model. Through LoRA, large models can be more flexibly applied to various scenarios, bringing more possibilities to the development of deep learning.
In addition, the idea of low-rank adaptation is simple. It is achieved by adding a bypass next to the original PLM (pre-trained language model), which performs dimensionality reduction and then dimensionality operations to simulate the so-called intrinsic dimensions. During the training process, the parameters of the PLM are fixed, and only the dimensionality reduction matrix A and the dimensionality enhancement matrix B are trained. The input and output dimensions of the model remain unchanged, but the parameters of BA and PLM are superimposed on the output. The dimensionality reduction matrix A is initialized using a random Gaussian distribution, while the dimensionality enhancement matrix B is initialized using a 0 matrix, which ensures that the bypass matrix is still a 0 matrix at the beginning of training.
This idea has some similarities with residual connection, which simulates the process of full finetuning by using bypass updates. In fact, full finetuning can be seen as a special case of LoRA, that is, when r equals k. This means that by applying LoRA to all weight matrices and training all bias terms, while setting the rank r of LoRA to the rank k of the pre-trained weight matrix, we can roughly restore the expressive power of full finetuning. In other words, as the number of trainable parameters increases, the training of LoRA tends to the training of the original model, while the adapter-based method tends to an MLP, and the prefix-based method tends to a model that cannot handle long input sequences. Therefore, LoRA provides a flexible way to balance the number of trainable parameters and the expressive power of the model.
Low-rank adaptation and neural network compression have some differences in goals and methods.
The goal of neural network compression is to reduce parameters and storage space, reduce computational costs and storage requirements, while maintaining performance. Methods include changing network structure, quantization and approximation, etc.
Neural network compression can be divided into three categories: approximation, quantization and cropping methods.
Approximate methods use matrix or tensor decomposition to reconstruct a small number of parameters and reduce network storage overhead.
2) The main idea of the quantization method is to map the possible values of the network parameters from the real number domain to a finite number set, or to represent the network parameters with fewer bits to reduce network storage overhead.
3) The clipping method will directly change the structure of the network, which can be divided into hierarchical clipping, neuron-level clipping and neural connection-level clipping according to the granularity.
Low-rank adaptation refers to reducing the complexity of the model by reducing the dimensionality of the model parameters, and is usually achieved using techniques such as matrix decomposition. This approach is often used to reduce the computational cost and storage requirements of the model while maintaining the model's predictive capabilities.
In general, neural network compression is a broader concept that covers a variety of methods to reduce the parameters and storage space of neural networks. Low-rank adaptation is a specific technique designed to reduce the complexity of large models by approximating them with low-dimensional structures.
The above is the detailed content of Adapting to large low-rank models. For more information, please follow other related articles on the PHP Chinese website!