Eight questions and eight answers to understand the inner workings of Transformer-AI-php.cn

Home

Technology peripherals

Eight questions and eight answers to understand the inner workings of Transformer

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Aug 07, 2024 pm 06:04 PM

deep learning project

Seven years ago, the paper "Attention is all you need" proposed the transformer architecture, subverting the entire field of deep learning.

Nowadays, all major models are based on the transformer architecture, but the internal working principle of the transformer is still an unsolved mystery.

Last year, Llion Jones, one of the authors of the transformer paper, announced the founding of the artificial intelligence company Sakana AI. Recently, Sakana AI published a paper titled "Transformer Layers as Painters", which explored the information flow in pre-trained transformers and conducted a series of experiments on decoder-only and encoder-only frozen transformer models. Note that this study did not perform any type of fine-tuning on the pre-trained model.

Eight questions and eight answers to understand the inner workings of Transformer

Paper address: https://arxiv.org/pdf/2407.09298v1

This research believes that the internal mechanism of transformer (especially the middle layer) can be understood by analogy to a painter's painting pipeline.

A painting pipeline usually passes the canvas (input) to a series of painters. Some painters are good at painting birds, while others are good at painting wheels. Each painter receives the canvas from the painter below it, and then it decides whether to add some strokes to the painting, or just pass it to the painter above it (using residual connections).

This analogy is not a strict theory, but a tool for thinking about transformer layers. Inspired by this analogy, the study tested and verified some hypotheses:

Are all layers using the same representation space?
Are all layers necessary?
Do the middle layers all perform the same function?
Is the order of layers important?
Can these layers run in parallel?
For some tasks, is order more important than other factors?
Do loops help layer parallelism?
Which variants have the least impact on model performance?

The study conducted a series of experiments on pre-trained LLM, which included experimenting with variations on the standard transformer execution strategy and measuring these changes on various benchmarks for decoder-only (Llama) and encoder-only (BERT) models impact on model performance.

Are all layers using the same representation space?

To answer whether different layers use the same representation space, the authors tested whether Transformer is robust when skipping specific layers or switching the order of adjacent layers. For example, in Llama2-7B, layer 6 typically expects to receive the output of layer 5. If layer 6 is given the output of layer 4, will it behave "catastrophically"?

In Figure 2, we can see that, except for the first and last few layers, the layers of Llama2-7B are quite robust to layer hopping or switching.

Eight questions and eight answers to understand the inner workings of Transformer

This experiment shows that the middle layer shares a representation space and has a different representation space from the "peripheral layers" (the first and last layers). To further test this hypothesis, the authors followed previous studies and measured the average cosine similarity between hidden state activations at different layers of the models in the baseline (Llama2-7B, Llama2-13B and BERT-Large). Figure 3 shows the consistency between all intermediate layers.

Eight questions and eight answers to understand the inner workings of Transformer

This shows that the model may have three different representation spaces of "beginning", "middle" and "end" layers. Answer to question 1: Yes, the intermediate layers seem to share a common representation space.

Are all layers necessary?

To further test whether the redirection space of the intermediate layer is truly shared (in addition to having close cosine similarity), this study tried "skipping the layer", that is, sending the output of the Nth layer directly to the N+Mth layer (where M > 1), thereby "skipping" the M − 1 layer, as shown in Figure 1a. The experiment was to see if layer N + M could understand the activations of layer N even though it was only trained on input from layer N + M − 1 . Figure 4 shows that both Llama2-7B and BERT-Large experience modest performance degradation on many benchmarks. Answering question 2, are all layers necessary:

No, at least some of the middle layers can be removed without catastrophic failure.

Eight questions and eight answers to understand the inner workings of Transformer

Do the middle layers all perform the same function?

If the middle layers all share a common representation space, does this mean that other middle layers are redundant? To test this, the researchers reran the "skip" experiment from the previous subsection, replacing the weights of the middle layer with the weights of the center layer, effectively looping T - 2N on each layer that was replaced. + 1 time, where T is the total number of layers (32 layers for Llama2-7B and 24 layers for BERT-Large).

Eight questions and eight answers to understand the inner workings of Transformer

図 5 に示すように、置換されるレイヤーの数が増加するにつれて、ベンチマークテストでのモデルのスコアが急速に低下することがわかります。以下の図 11 に示すように、このレイヤーを置き換える方法は、研究者が試した他のどの方法よりも劣っています。したがって、研究者らは、中間層は異なる機能を実行し、中間層間で重みを共有することは現実的ではないと結論付けました。

Eight questions and eight answers to understand the inner workings of Transformer

レイヤーの順序は重要ですか?

以前の実験では、中間層が表現空間を共有しているが、この空間内で異なる機能を担当していることが示されました。次に取り組むべき問題は、これらの関数の順序が何を意味するかということです。この問題を解決するために、研究者らは 2 セットの実験を計画しました。まず、中間層をトレーニングしたときとは逆の順序で実行します。具体的には、T - N 番目の層の出力を取得し、それを T - N - 1 番目の層に入力し、次にこの層の出力を T - N - 2 番目の層に入力し、以下同様に N 番目の層まで繰り返します。次に、この層の出力を後続の T - N 層に送信します。 2 番目の実験セットでは、研究者らは中間層をランダムな順序で実行し、10 個のシード値にわたってそれらの平均をとりました。

図 6 と 7 は、中間層をそれぞれ逆方向とランダムな順序で実行した結果を示しており、モデルはすべての基本テストセットにわたって徐々に下降傾向を示しています。これは、レイヤーの順序がモデルにとってある程度重要であるにもかかわらず、順序が変更されてもレイヤーが引き続き機能できることも示しています。

さらに興味深いことに、レイヤーの順序をランダムにシャッフルする方が、まったくその逆を行うよりも効果的です。これは、何らかの方法で順序をランダムにシャッフルすると、レイヤー間の元の関係の一部が維持される (つまり、レイヤー i がレイヤー j の後に来る、ただし i > j) のに対し、完全に逆にすると、これらの関係が完全に破壊されるためと考えられます。

これらのレイヤーは並行して実行できますか?

層自体の存在が実行順序よりも重要であることを検証するために、研究者らは中間層を並行して実行し、その平均結果を最後の N 層に送信する実験を設計しました。

図 8 に示すように、すべてのベンチマークテストでモデルのパフォーマンスは緩やかな下降傾向を示していますが、この傾向は GSM8K の数学的文章問題には当てはまりません。

実験結果によると、この方法はほとんどの場合に効果的ですが、一部の複雑な数学的問題はうまく処理できません。この並列処理方法は、レイヤーをスキップするよりは劣りますが、レイヤーを逆の順序で実行するほど優れたものではありません。これに基づいて研究者らは、並列演算層は一般的なケースでは実現可能だが、逐次的な論理的理解を必要とする数学的問題にはこの方法が適さない可能性があると結論付けた。

Eight questions and eight answers to understand the inner workings of Transformer

一部のタスクでは、順序が他の要素よりも重要ですか?

ほとんどの「改造」モデルでは、抽象推論 (ARC) または数学的推論 (GSM8K) のベンチマークに直面すると、最も急激な下降傾向を示す傾向があります。この現象は、ステップバイステップの推論タスクが、主に意味の理解に依存する常識的なタスクよりもモデルレベルの順序にはるかに敏感であるという事実に起因している可能性があります。セマンティクスのみを理解することで達成できるタスクとは異なり、推論タスクではモデルが構造と意味の両方を把握する必要があります。この観察は、モデルが単一の処理セッション中にある程度の順序に依存した推論を実行する可能性があるという仮説と一致します。

研究者たちは比喩を使って説明しました。さまざまな要素で構成されるコラージュを描いている場合、絵の順序はそれほど重要ではないかもしれませんが、正確な建築シーンを描いている場合は、各ストロークの順序が重要です。順序が非常に重要になります。これに基づいて、研究者らは、数学と推論のタスクはモデル層の順序に大きく依存する一方、主に意味の理解に依存するタスクでは順序の影響は比較的小さいと結論付けました。

ループはレイヤー間の並列化に役立ちますか?

前節の絵画の比喩に続き、画家が絵を描くとき、最初にすべてを描くのではなく、最初に車体などの部分をペイントし、その後、その部分をベースに他のものを追加します。、ホイールなど。 AI モデルでは、レイヤーはいわゆるペインターであり、情報の処理はペイントです。正しい情報が最初に得られ、いわゆる車体が最初に描画されると、作業をより適切に完了し、描画に貢献できます。車輪を追加します。

トランスフォーマーの場合、適切な入力が与えられた場合、レイヤーは残留接続を通じて入力を「渡す」のではなく、順方向伝播にのみ寄与します。この場合、前の実験の並列レイヤーを反復すると、並列レイヤーを 1 回実行するよりもモデルのパフォーマンスが向上するはずです。これに基づいて、研究者らは、並列層の平均出力を同じ層に一定の反復回数供給することでこれをテストしました。

図 9 は、並列層を 3 回ループした結果を示しています。ループを 3 回並列化した結果は、1 回の反復 (並列層) よりも大幅に優れています。開始レイヤー N が 15 (Llama2-7B モデルの場合) または 11 (BERT モデルの場合) に設定されている場合、これは各ケースの左端にあり、単一レイヤーのみが影響を受けます。この特定のケースでは、ループを 3 回並列化する効果は、中間層を単純に 3 回繰り返すのと同じです。同時に、この時点での並列層のパフォーマンスは完全なモデルと区別できません。

Eight questions and eight answers to understand the inner workings of Transformer

研究者らはまた、反復回数を変えて同じ実験を繰り返しました。図 10 は、並列化層の数 M と反復数の関数として Llama2-7B のパフォーマンスを示しています。各 M の最高パフォーマンスの反復番号は赤いボックスでマークされます。最適な反復数は、M=29 と M=31 (ほぼすべてのレイヤーが並列化される) を除き、並列化されたレイヤーの数にほぼ線形に比例します。したがって、研究者らは、最適な反復回数は並列化層の数に比例すると結論付けました。

Eight questions and eight answers to understand the inner workings of Transformer

モデルのパフォーマンスへの影響を最小限に抑えてレイヤーを調整するにはどうすればよいですか?

最後に、図 11 では、研究者はすべての実験における Transformer の「変換」を比較し、すべてのベンチマークの中央値または平均をグラフに示しています。

Eight questions and eight answers to understand the inner workings of Transformer

中間の複製 (中間層を中間層の同じ数のコピーで置き換える) のパフォーマンスは最悪で、すぐにランダムなベースラインパフォーマンスに低下しました。対照的に、ループの並列性とランダムなレイヤー順序の影響は最小限です。したがって、研究者らは、単一レイヤーの重複が最も深刻な影響を与えると結論付けました。レイヤーの順序のランダム化とループの並列処理による影響は最小限に抑えられます。

これらの実験では、全体的にパフォーマンスが緩やかに低下していることが示されていますが、なぜこれらの層がほとんどの摂動下でもある程度の堅牢性を維持できるのかは研究者らにもまだわかっていません。この問題は今後の研究でさらに調査する必要があります。

詳しくは原文をご覧ください。

^{参考リンク：https://arxiv.org/pdf/2407.09298v1}

The above is the detailed content of Eight questions and eight answers to understand the inner workings of Transformer. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks ago By DDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

2 weeks ago By DDD

Where to find the Crane Control Keycard in Atomfall

3 weeks ago By DDD

Roblox: Dead Rails - How To Complete Every Challenge

4 weeks ago By DDD

Atomfall guide: item locations, quest guides, and tips

4 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7667

CakePHP Tutorial

1393

C# Tutorial

1206

What is the format of the account name of steam

win11 activation key permanent

Related knowledge

The author of ControlNet has another hit! The whole process of generating a painting from a picture, earning 1.4k stars in two days Jul 17, 2024 am 01:56 AM

It is also a Tusheng video, but PaintsUndo has taken a different route. ControlNet author LvminZhang started to live again! This time I aim at the field of painting. The new project PaintsUndo has received 1.4kstar (still rising crazily) not long after it was launched. Project address: https://github.com/lllyasviel/Paints-UNDO Through this project, the user inputs a static image, and PaintsUndo can automatically help you generate a video of the entire painting process, from line draft to finished product. follow. During the drawing process, the line changes are amazing. The final video result is very similar to the original image: Let’s take a look at a complete drawing.

Topping the list of open source AI software engineers, UIUC's agent-less solution easily solves SWE-bench real programming problems Jul 17, 2024 pm 10:02 PM

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com The authors of this paper are all from the team of teacher Zhang Lingming at the University of Illinois at Urbana-Champaign (UIUC), including: Steven Code repair; Deng Yinlin, fourth-year doctoral student, researcher

Posthumous work of the OpenAI Super Alignment Team: Two large models play a game, and the output becomes more understandable Jul 19, 2024 am 01:29 AM

If the answer given by the AI model is incomprehensible at all, would you dare to use it? As machine learning systems are used in more important areas, it becomes increasingly important to demonstrate why we can trust their output, and when not to trust them. One possible way to gain trust in the output of a complex system is to require the system to produce an interpretation of its output that is readable to a human or another trusted system, that is, fully understandable to the point that any possible errors can be found. For example, to build trust in the judicial system, we require courts to provide clear and readable written opinions that explain and support their decisions. For large language models, we can also adopt a similar approach. However, when taking this approach, ensure that the language model generates

From RLHF to DPO to TDPO, large model alignment algorithms are already 'token-level' Jun 24, 2024 pm 03:04 PM

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com In the development process of artificial intelligence, the control and guidance of large language models (LLM) has always been one of the core challenges, aiming to ensure that these models are both powerful and safe serve human society. Early efforts focused on reinforcement learning methods through human feedback (RL

A significant breakthrough in the Riemann Hypothesis! Tao Zhexuan strongly recommends new papers from MIT and Oxford, and the 37-year-old Fields Medal winner participated Aug 05, 2024 pm 03:32 PM

Recently, the Riemann Hypothesis, known as one of the seven major problems of the millennium, has achieved a new breakthrough. The Riemann Hypothesis is a very important unsolved problem in mathematics, related to the precise properties of the distribution of prime numbers (primes are those numbers that are only divisible by 1 and themselves, and they play a fundamental role in number theory). In today's mathematical literature, there are more than a thousand mathematical propositions based on the establishment of the Riemann Hypothesis (or its generalized form). In other words, once the Riemann Hypothesis and its generalized form are proven, these more than a thousand propositions will be established as theorems, which will have a profound impact on the field of mathematics; and if the Riemann Hypothesis is proven wrong, then among these propositions part of it will also lose its effectiveness. New breakthrough comes from MIT mathematics professor Larry Guth and Oxford University

arXiv papers can be posted as 'barrage', Stanford alphaXiv discussion platform is online, LeCun likes it Aug 01, 2024 pm 05:18 PM

cheers! What is it like when a paper discussion is down to words? Recently, students at Stanford University created alphaXiv, an open discussion forum for arXiv papers that allows questions and comments to be posted directly on any arXiv paper. Website link: https://alphaxiv.org/ In fact, there is no need to visit this website specifically. Just change arXiv in any URL to alphaXiv to directly open the corresponding paper on the alphaXiv forum: you can accurately locate the paragraphs in the paper, Sentence: In the discussion area on the right, users can post questions to ask the author about the ideas and details of the paper. For example, they can also comment on the content of the paper, such as: "Given to

The first Mamba-based MLLM is here! Model weights, training code, etc. have all been open source Jul 17, 2024 am 02:46 AM

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com. Introduction In recent years, the application of multimodal large language models (MLLM) in various fields has achieved remarkable success. However, as the basic model for many downstream tasks, current MLLM consists of the well-known Transformer network, which

Axiomatic training allows LLM to learn causal reasoning: the 67 million parameter model is comparable to the trillion parameter level GPT-4 Jul 17, 2024 am 10:14 AM

Show the causal chain to LLM and it learns the axioms. AI is already helping mathematicians and scientists conduct research. For example, the famous mathematician Terence Tao has repeatedly shared his research and exploration experience with the help of AI tools such as GPT. For AI to compete in these fields, strong and reliable causal reasoning capabilities are essential. The research to be introduced in this article found that a Transformer model trained on the demonstration of the causal transitivity axiom on small graphs can generalize to the transitive axiom on large graphs. In other words, if the Transformer learns to perform simple causal reasoning, it may be used for more complex causal reasoning. The axiomatic training framework proposed by the team is a new paradigm for learning causal reasoning based on passive data, with only demonstrations

See all articles