Reduce the Transformer rank to improve performance while maintaining LLM without reducing the removal of more than 90% of components in a specific layer-AI-php.cn

Home

Reduce the Transformer rank to improve performance while maintaining LLM without reducing the removal of more than 90% of components in a specific layer

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jan 13, 2024 pm 09:51 PM

project laser

MIT and Microsoft conducted joint research and found that it is possible to improve the task performance of large language models and reduce their size without additional training

In the era of large-scale models, Transformer supports the entire scientific research field with its unique capabilities. Since its introduction, Transformer-based language models (LLM) have demonstrated excellent performance in various tasks. The underlying architecture of Transformer has become the state-of-the-art technology for natural language modeling and reasoning, and has shown strong prospects in fields such as computer vision and reinforcement learning

However, the current Transformer architecture is very large and usually requires a lot of computing resources for training and inference.

Rewrite like this: It makes sense to do this because a Transformer trained with more parameters or data is obviously more capable than other models. However, an increasing number of studies have shown that Transformer-based models and neural networks do not need to retain all adaptation parameters to maintain their learned hypotheses

In general, over-parameterization seems to be very problematic when training models. Helpful, but these models can be heavily pruned before inference. Studies have shown that neural networks can often remove more than 90% of weights without any significant drop in performance. This phenomenon has triggered researchers’ interest in pruning strategies that help model reasoning

Researchers from MIT and Microsoft wrote in the paper "The Truth is in There: Improving Reasoning in Language Models with Layer- Selective Rank Reduction made a surprising finding that careful pruning on specific layers of the Transformer model can significantly improve the performance of the model on certain tasks.

Reduce the Transformer rank to improve performance while maintaining LLM without reducing the removal of more than 90% of components in a specific layer

Please click the following link to view the paper: https://arxiv.org/pdf/2312.13558.pdf
Paper home page: https://pratyushasharma.github.io/laser/

The study calls this simple intervention LASER (Layer Selective Rank Reduction), The performance of LLM is significantly improved by selectively reducing the high-order components of the learning weight matrix of specific layers in the Transformer model through singular value decomposition. This operation can be performed after the model training is completed without additional parameters or data

During the operation, the reduction of weights is performed in the model-specific weight matrix and layer. The study also found that many similar matrices can be significantly reduced in weight, and typically no performance degradation is observed until more than 90% of the components are removed. The study also found that these reductions can significantly improve accuracy, This finding appears to be not limited to natural language, with performance improvements also found in reinforcement learning.

In addition, this research attempts to infer what is stored in higher-order components so that it can be deleted to improve performance. The study found that LASER answered the correct questions, but before the intervention, the original model mainly responded with high-frequency words (such as "the", "of", etc.), which were not even the same semantic type as the correct answer, and also That is to say, these components will cause the model to generate some irrelevant high-frequency words without intervention.

However, by performing a certain degree of rank reduction, the model's answer can be transformed into correct.

To understand this, the study also explores what the remaining components individually encode, using only their higher-order singular vectors to approximate the weight matrix. It was found that these components described different responses or common high-frequency words in the same semantic category as the correct answer.

These results suggest that when noisy higher-order components are combined with lower-order components, their conflicting responses produce an average answer that may be incorrect. Figure 1 provides a visual representation of the Transformer architecture and the procedure followed by LASER. Here, the weight matrix of a specific layer of multilayer perceptron (MLP) is replaced by its low-rank approximation.

LASER OVERVIEW

Provides a detailed introduction to LASER intervention. A single-step LASER intervention is defined by the triplet (τ, ℓ, ρ), which contains the parameter τ, the number of layers ℓ and the reduced rank ρ. Together these values describe the matrices to be replaced by their low-rank approximations, and the degree of approximation. Researchers classify the types of matrices they will intervene on based on parameter types

Researchers focus on matrices in W = {W_q, W_k, W_v, W_o, U_in, U_out}, which consists of MLP and attention layers composed of matrices in . The number of strata represents the stratum of researcher intervention (the first stratum is indexed starting from 0). For example, Llama-2 has 32 layers, so ℓ ∈ {0, 1, 2,・・・31}.

Ultimately, ρ ∈ [0, 1) describes which part of the maximum rank should be preserved when making low-rank approximations. For example, assuming

, the maximum rank of the matrix is d. The researchers replaced it with the ⌊ρ・d⌋- approximation.

Reduce the Transformer rank to improve performance while maintaining LLM without reducing the removal of more than 90% of components in a specific layer Figure 1 below is an example of LASER. In this figure, τ = U_in and ℓ = L represent updating the weight matrix of the first layer of MLP in the Transformer block of the L^th layer. Another parameter controls k in the rank-k approximation.

Reduce the Transformer rank to improve performance while maintaining LLM without reducing the removal of more than 90% of components in a specific layer

LASER can restrict the flow of certain information in a network and unexpectedly yield significant performance benefits. These interventions can also be easily combined, such that a set of interventions can be applied in any order. Reduce the Transformer rank to improve performance while maintaining LLM without reducing the removal of more than 90% of components in a specific layer

The LASER method is simply a search for such interventions, modified to deliver the greatest benefit. However, there are many other ways to combine these interventions, which is a direction for future work.

In order to keep the original meaning unchanged, the content needs to be rewritten into Chinese. There is no need to appear the original sentence

In the experimental part, the researcher used the GPT-J model pre-trained on the PILE data set. The number of layers of the model is 27 and the parameters are 6 billion. The model's behavior is then evaluated on the CounterFact dataset, which contains samples of (topic, relation, and answer) triples, with three paraphrase prompts provided for each question.

The first is the analysis of the GPT-J model on the CounterFact data set. Figure 2 below shows the impact on the classification loss of the dataset as a result of applying different amounts of rank reduction to each matrix in the Transformer architecture. Each of the Transformer layers consists of a two-layer small MLP, with the input and output matrices shown separately. Different colors represent different percentages of removed components.

Reduce the Transformer rank to improve performance while maintaining LLM without reducing the removal of more than 90% of components in a specific layer

Regarding improving the accuracy and robustness of interpretation, as shown in Figure 2 above and Table 1 below, the researchers found that when performing rank reduction on a single layer, GPT The factual accuracy of the -J model on the CounterFact dataset increased from 13.1% to 24.0%. It is important to note that these improvements are only the result of rank reduction and do not involve any further training or fine-tuning of the model.

Reduce the Transformer rank to improve performance while maintaining LLM without reducing the removal of more than 90% of components in a specific layer

#Which facts will be restored in the data set through rank reduction has become a concern for researchers. The researchers found that the fact of recovery through rank reduction rarely appeared in the data, as shown in Figure 3

Reduce the Transformer rank to improve performance while maintaining LLM without reducing the removal of more than 90% of components in a specific layer

What do higher-order components store? Researchers use high-order components to approximate the final weight matrix. Unlike LASER, they do not use low-order components for approximation, as shown in Figure 5(a). They measured the average cosine similarity between the true and predicted answers when approximating the matrix using different numbers of higher-order components, as shown in Figure 5(b)

Reduce the Transformer rank to improve performance while maintaining LLM without reducing the removal of more than 90% of components in a specific layer

Finally, the researchers evaluated the generalizability of their findings to three different LLMs on multiple language understanding tasks. For each task, they evaluated the model's performance by generating three metrics: accuracy, classification accuracy, and loss. As shown in Table 1 above, even if the rank reduction is large, it will not cause the model accuracy to decrease, but it can improve the model performance.

The above is the detailed content of Reduce the Transformer rank to improve performance while maintaining LLM without reducing the removal of more than 90% of components in a specific layer. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks ago By DDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

2 weeks ago By DDD

Where to find the Crane Control Keycard in Atomfall

3 weeks ago By DDD

Saving in R.E.P.O. Explained (And Save Files)

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows - How To Find The Blacksmith And Unlock Weapon And Armour Customisation

4 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7563

CakePHP Tutorial

1385

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

Related knowledge

The author of ControlNet has another hit! The whole process of generating a painting from a picture, earning 1.4k stars in two days Jul 17, 2024 am 01:56 AM

It is also a Tusheng video, but PaintsUndo has taken a different route. ControlNet author LvminZhang started to live again! This time I aim at the field of painting. The new project PaintsUndo has received 1.4kstar (still rising crazily) not long after it was launched. Project address: https://github.com/lllyasviel/Paints-UNDO Through this project, the user inputs a static image, and PaintsUndo can automatically help you generate a video of the entire painting process, from line draft to finished product. follow. During the drawing process, the line changes are amazing. The final video result is very similar to the original image: Let’s take a look at a complete drawing.

Topping the list of open source AI software engineers, UIUC's agent-less solution easily solves SWE-bench real programming problems Jul 17, 2024 pm 10:02 PM

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com The authors of this paper are all from the team of teacher Zhang Lingming at the University of Illinois at Urbana-Champaign (UIUC), including: Steven Code repair; Deng Yinlin, fourth-year doctoral student, researcher

Posthumous work of the OpenAI Super Alignment Team: Two large models play a game, and the output becomes more understandable Jul 19, 2024 am 01:29 AM

If the answer given by the AI model is incomprehensible at all, would you dare to use it? As machine learning systems are used in more important areas, it becomes increasingly important to demonstrate why we can trust their output, and when not to trust them. One possible way to gain trust in the output of a complex system is to require the system to produce an interpretation of its output that is readable to a human or another trusted system, that is, fully understandable to the point that any possible errors can be found. For example, to build trust in the judicial system, we require courts to provide clear and readable written opinions that explain and support their decisions. For large language models, we can also adopt a similar approach. However, when taking this approach, ensure that the language model generates

From RLHF to DPO to TDPO, large model alignment algorithms are already 'token-level' Jun 24, 2024 pm 03:04 PM

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com In the development process of artificial intelligence, the control and guidance of large language models (LLM) has always been one of the core challenges, aiming to ensure that these models are both powerful and safe serve human society. Early efforts focused on reinforcement learning methods through human feedback (RL

arXiv papers can be posted as 'barrage', Stanford alphaXiv discussion platform is online, LeCun likes it Aug 01, 2024 pm 05:18 PM

cheers! What is it like when a paper discussion is down to words? Recently, students at Stanford University created alphaXiv, an open discussion forum for arXiv papers that allows questions and comments to be posted directly on any arXiv paper. Website link: https://alphaxiv.org/ In fact, there is no need to visit this website specifically. Just change arXiv in any URL to alphaXiv to directly open the corresponding paper on the alphaXiv forum: you can accurately locate the paragraphs in the paper, Sentence: In the discussion area on the right, users can post questions to ask the author about the ideas and details of the paper. For example, they can also comment on the content of the paper, such as: "Given to

A significant breakthrough in the Riemann Hypothesis! Tao Zhexuan strongly recommends new papers from MIT and Oxford, and the 37-year-old Fields Medal winner participated Aug 05, 2024 pm 03:32 PM

Recently, the Riemann Hypothesis, known as one of the seven major problems of the millennium, has achieved a new breakthrough. The Riemann Hypothesis is a very important unsolved problem in mathematics, related to the precise properties of the distribution of prime numbers (primes are those numbers that are only divisible by 1 and themselves, and they play a fundamental role in number theory). In today's mathematical literature, there are more than a thousand mathematical propositions based on the establishment of the Riemann Hypothesis (or its generalized form). In other words, once the Riemann Hypothesis and its generalized form are proven, these more than a thousand propositions will be established as theorems, which will have a profound impact on the field of mathematics; and if the Riemann Hypothesis is proven wrong, then among these propositions part of it will also lose its effectiveness. New breakthrough comes from MIT mathematics professor Larry Guth and Oxford University

The first Mamba-based MLLM is here! Model weights, training code, etc. have all been open source Jul 17, 2024 am 02:46 AM

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com. Introduction In recent years, the application of multimodal large language models (MLLM) in various fields has achieved remarkable success. However, as the basic model for many downstream tasks, current MLLM consists of the well-known Transformer network, which

Axiomatic training allows LLM to learn causal reasoning: the 67 million parameter model is comparable to the trillion parameter level GPT-4 Jul 17, 2024 am 10:14 AM

Show the causal chain to LLM and it learns the axioms. AI is already helping mathematicians and scientists conduct research. For example, the famous mathematician Terence Tao has repeatedly shared his research and exploration experience with the help of AI tools such as GPT. For AI to compete in these fields, strong and reliable causal reasoning capabilities are essential. The research to be introduced in this article found that a Transformer model trained on the demonstration of the causal transitivity axiom on small graphs can generalize to the transitive axiom on large graphs. In other words, if the Transformer learns to perform simple causal reasoning, it may be used for more complex causal reasoning. The axiomatic training framework proposed by the team is a new paradigm for learning causal reasoning based on passive data, with only demonstrations

See all articles