Home > Technology peripherals > AI > RoSA: A new method for efficient fine-tuning of large model parameters

RoSA: A new method for efficient fine-tuning of large model parameters

WBOY
Release: 2024-01-18 17:27:17
forward
584 people have browsed it

As language models scale to unprecedented scale, comprehensive fine-tuning of downstream tasks becomes prohibitively expensive. In order to solve this problem, researchers began to pay attention to and adopt the PEFT method. The main idea of ​​the PEFT method is to limit the scope of fine-tuning to a small set of parameters to reduce computational costs while still achieving state-of-the-art performance on natural language understanding tasks. In this way, researchers can save computing resources while maintaining high performance, bringing new research hotspots to the field of natural language processing.

RoSA: 一种新的大模型参数高效微调方法

RoSA is a new PEFT technology. Through experiments on a set of benchmark tests, it was found that when using the same parameter budget, RoSA performs better than previous low-rank adaptation (LoRA) and pure sparse fine-tuning methods.

This article will delve into RoSA principles, methods and results, explaining how its performance marks meaningful progress. For those who want to effectively fine-tune large language models, RoSA provides a new solution that is superior to previous solutions.

RoSA: 一种新的大模型参数高效微调方法

Demand for efficient fine-tuning of parameters

NLP has been replaced by transformer-based language models such as GPT- 4 Complete changes. These models learn powerful language representations by pre-training on large text corpora. They then transfer these representations to downstream language tasks through a simple process.

As model size grows from billions to trillions of parameters, fine-tuning brings a huge computational burden. For example, for a model like GPT-4 with 1.76 trillion parameters, fine-tuning can cost millions of dollars. This makes deployment in real applications very impractical.

PEFT method improves efficiency and accuracy by limiting the parameter range of fine-tuning. Recently, a variety of PEFT technologies have emerged that trade off efficiency and accuracy.

LoRA

#One prominent PEFT method is low-rank adaptation (LoRA). LoRA was launched in 2021 by researchers from Meta and MIT. This approach is motivated by their observation that the transformer exhibits low-rank structure in its head matrix. LoRA is proposed to take advantage of this low-rank structure to reduce computational complexity and improve model efficiency and speed.

LoRA only fine-tunes the first k singular vectors, while other parameters remain unchanged. This only requires O(k) extra parameters to tune, instead of O(n).

By leveraging this low-rank structure, LoRA can capture meaningful signals needed for generalization to downstream tasks and limit fine-tuning to these top singular vectors, enabling optimization and inference More effective.

Experiments show that LoRA can match fully fine-tuned performance on the GLUE benchmark while using more than 100 times fewer parameters. However, as the model size continues to expand, obtaining strong performance through LoRA requires increasing rank k, reducing the computational savings compared to full fine-tuning.

Before RoSA, LoRA represented the state-of-the-art in PEFT methods, with only modest improvements using techniques such as different matrix factorization or adding a small number of additional fine-tuning parameters.

Robust Adaptation (RoSA)

Robust Adaptation (RoSA) introduces a new parameter-efficient fine-tuning method. RoSA is inspired by robust principal component analysis (robust PCA), rather than relying solely on low-rank structures.

In traditional principal component analysis, the data matrix X is decomposed into matrix. Robust PCA goes a step further and decomposes X into a clean low-rank L and a "contaminated/corrupted" sparse S.

RoSA draws inspiration from this and decomposes the fine-tuning of the language model into:

A low-rank adaptive (L) matrix similar to LoRA , fine-tuned to approximate the dominant task-related signal

A highly sparse fine-tuning (S) matrix containing a very small number of large, selectively fine-tuned parameters that encode L miss the residual signal.

Explicitly modeling the residual sparse component allows RoSA to achieve higher accuracy than LoRA alone.

RoSA constructs L by performing a low-rank decomposition of the model’s head matrix. This will encode underlying semantic representations useful for downstream tasks. RoSA then selectively fine-tunes the top m most important parameters of each layer to S, while all other parameters remain unchanged. This step captures residual signals that are not suitable for low-rank fitting.

The number of fine-tuning parameters m is an order of magnitude smaller than the rank k required by LoRA alone. Therefore, combined with the low-rank head matrix in L, RoSA maintains extremely high parameter efficiency.

RoSA also uses some other simple but effective optimizations:

Residual sparse connection: S residuals are added directly to the output of each transformer block before it goes through layer normalization and feedforward sublayers. This can simulate signals missed by L.

Independent Sparse Mask: The metrics selected in S for fine-tuning are generated independently for each transformer layer.

Shared low-rank structure: The same low-rank base U,V matrices are shared between all layers of L, just like in LoRA. This will capture semantic concepts in a consistent subspace.

These architectural choices provide RoSA modeling with the flexibility akin to full fine-tuning while maintaining parameter efficiency for optimization and inference. Utilizing this PEFT method that combines robust low-rank adaptation and highly sparse residuals, RoSA achieves a new technology of accuracy-efficiency trade-off.

Experiments and Results

The researchers evaluated RoSA on a comprehensive benchmark of 12 NLU datasets covering Tasks such as text detection, sentiment analysis, natural language reasoning and robustness testing. They conducted experiments using RoSA based on artificial intelligence assistant LLM, using a 12 billion parameter model.

On every task, RoSA performs significantly better than LoRA when using the same parameters. The total parameters of both methods are approximately 0.3% of the entire model. This means that there are about 4.5 million fine-tuning parameters in both cases for k = 16 for LoRA and m = 5120 for RoSA.

RoSA: 一种新的大模型参数高效微调方法

RoSA also matches or exceeds the performance of pure sparse fine-tuned baselines.

On the ANLI benchmark, which evaluates robustness to adversarial examples, RoSA scores 55.6, while LoRA scores 52.7. This demonstrates improvements in generalization and calibration.

For the sentiment analysis tasks SST-2 and IMDB, the accuracy of RoSA reaches 91.2% and 96.9%, while the accuracy of LoRA reaches 90.1% and 95.3%.

On WIC, a challenging word sense disambiguation test, RoSA has an F1 score of 93.5, while LoRA has an F1 score of 91.7.

Across all 12 datasets, RoSA generally shows better performance than LoRA under matching parameter budgets.

Notably, RoSA is able to achieve these gains without requiring any task-specific tuning or specialization. This makes RoSA suitable for use as a universal PEFT solution.

Summary

As the size of language models continues to grow rapidly, reducing the computational requirements for fine-tuning them is an urgent problem that needs to be solved. Parameter-efficient adaptive training techniques like LoRA have shown initial success but face inherent limitations of low-rank approximation.

RoSA organically combines robust low-rank decomposition and highly sparse residual fine-tuning to provide a convincing new solution. It greatly improves the performance of PEFT by considering signals that escape low-rank fitting through selective sparse residuals. Empirical evaluation shows significant improvements over LoRA and uncontrolled sparsity baselines on different NLU task sets.

RoSA is conceptually simple but highly performant, and can further advance cross-research on parameter efficiency, adaptive representation, and continuous learning to expand language intelligence.

The above is the detailed content of RoSA: A new method for efficient fine-tuning of large model parameters. For more information, please follow other related articles on the PHP Chinese website!

source:51cto.com
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template