Bengio等人新作：注意力可被视为RNN，新模型媲美Transformer，但超级省内存-人工智能-PHP中文网

首页

科技周边

人工智能

Bengio等人新作：注意力可被视为RNN，新模型媲美Transformer，但超级省内存

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jun 09, 2024 pm 04:50 PM

理论

序列建模的进展具有极大的影响力，因为它们在广泛的应用中发挥着重要作用，包括强化学习（例如，机器人和自动驾驶）、时间序列分类（例如，金融欺诈检测和医学诊断）等。

在过去的几年里，Transformer 的出现标志着序列建模中的一个重大突破，这主要得益于 Transformer 提供了一种能够利用 GPU 并行处理的高性能架构。

然而，Transformer 在推理时计算开销很大，主要在于内存和计算需求呈二次扩展，从而限制了其在低资源环境中的应用（例如，移动和嵌入式设备）。尽管可以采用 KV 缓存等技术提高推理效率，但 Transformer 对于低资源领域来说仍然非常昂贵，原因在于：（1）随 token 数量线性增加的内存，以及（2）缓存所有先前的 token 到模型中。在具有长上下文（即大量 token）的环境中，这一问题对 Transformer 推理的影响更大。

为了解决这个问题，加拿大皇家银行 AI 研究所 Borealis AI、蒙特利尔大学的研究者在论文《Attention as an RNN 》中给出了解决方案。值得一提的是，我们发现图灵奖得主 Yoshua Bengio 出现在作者一栏里。

Bengio等人新作：注意力可被视为RNN，新模型媲美Transformer，但超级省内存

论文地址：https://arxiv.org/pdf/2405.13956
论文标题：Attention as an RNN

具体而言，研究者首先检查了 Transformer 中的注意力机制，这是导致 Transformer 计算复杂度呈二次增长的组件。该研究表明注意力机制可以被视为一种特殊的循环神经网络（RNN），具有高效计算的多对一（many-to-one）RNN 输出的能力。利用注意力的 RNN 公式，该研究展示了流行的基于注意力的模型（例如 Transformer 和 Perceiver）可以被视为 RNN 变体。

然而，与 LSTM、GRU 等传统 RNN 不同，Transformer 和 Perceiver 等流行的注意力模型虽然可以被视为 RNN 变体。但遗憾的是，它们无法高效地使用新 token 进行更新。

为了解决这个问题，该研究引入了一种基于并行前缀扫描（prefix scan）算法的新的注意力公式，该公式能够高效地计算注意力的多对多（many-to-many）RNN 输出，从而实现高效的更新。

在此新注意力公式的基础上，该研究提出了 Aaren（[A] ttention [a] s a [re] current neural [n] etwork），这是一种计算效率很高的模块，不仅可以像 Transformer 一样并行训练，还可以像 RNN 一样高效更新。

实验结果表明，Aaren 在 38 个数据集上的表现与 Transformer 相当，这些数据集涵盖了四种常见的序列数据设置：强化学习、事件预测、时间序列分类和时间序列预测任务，同时在时间和内存方面更加高效。

方法介绍

为了解决上述问题，作者提出了一种基于注意力的高效模块，它能够利用 GPU 并行性，同时又能高效更新。

首先，作者在第 3.1 节中表明，注意力可被视为一种 RNN，具有高效计算多对一 RNN（图 1a）输出的特殊能力。利用注意力的 RNN 形式，作者进一步说明，基于注意力的流行模型，如 Transformer（图 1b）和 Perceiver（图 1c），可以被视为 RNN。然而，与传统的 RNN 不同的是，这些模型无法根据新 token 有效地更新自身，从而限制了它们在数据以流的形式到达的序列问题中的潜力。

为了解决这个问题，作者在第 3.2 节中介绍了一种基于并行前缀扫描算法的多对多 RNN 计算注意力的高效方法。在此基础上，作者在第 3.3 节中介绍了 Aaren—— 一个计算效率高的模块，它不仅可以并行训练（就像 Transformer），还可以在推理时用新 token 高效更新，推理只需要恒定的内存（就像传统 RNN）。

将注意力视为一个多对一 RNN

查询向量 q 的注意力可被视为一个函数，它通过 N 个上下文 token x_1:N 的键和值 Bengio等人新作：注意力可被视为RNN，新模型媲美Transformer，但超级省内存

将其映射到单一输出 o_N = Attention (q, k_1:N , v_1:N ) 。给定 s_i = dot (q，k_i)，输出 o_N 可表述为：

其中分子为

，分母为

。将注意力视为 RNN，可以在 k = 1，...，...... 时，以滚动求和的方式迭代计算 Bengio等人新作：注意力可被视为RNN，新模型媲美Transformer，但超级省内存

和

。然而，在实践中，这种实现方式并不稳定，会因有限的精度表示和可能非常小或非常大的指数（即 exp (s)）而遇到数值问题。为了缓解这一问题，作者用累积最大值项 Bengio等人新作：注意力可被视为RNN，新模型媲美Transformer，但超级省内存

来重写递推公式，计算 Bengio等人新作：注意力可被视为RNN，新模型媲美Transformer，但超级省内存

和

。值得注意的是，最终结果是相同的 Bengio等人新作：注意力可被视为RNN，新模型媲美Transformer，但超级省内存

，m_k 的循环计算如下：

Bengio等人新作：注意力可被视为RNN，新模型媲美Transformer，但超级省内存

By encapsulating the loop calculations of a_k, c_k and m_k from a_(k-1), c_(k-1) and m_(k-1), the author introduces an RNN unit that can iteratively calculate attention output (see Figure 2). The attention RNN unit takes (a_(k-1), c_(k-1), m_(k-1), q) as input and computes (a_k, c_k, m_k, q). Note that the query vector q is passed in the RNN unit. The initial hidden state of the attention RNN is (a_0, c_0, m_0, q) = (0, 0, 0, q). Bengio等人新作：注意力可被视为RNN，新模型媲美Transformer，但超级省内存

Methods to calculate attention: By considering attention as an RNN, you can see different ways to calculate attention: loop calculation token by token in O (1) memory (i.e. sequential calculation) ;or computed in the traditional way (i.e. parallel computing), which requires linear O (N) memory. Since attention can be regarded as an RNN, the traditional method of calculating attention can also be regarded as an efficient method of calculating the output of the attention many-to-one RNN, that is, the output of the RNN takes multiple context tokens as input, but in At the end of the RNN, only one token is output (see Figure 1a). Finally, attention can also be computed as an RNN that processes tokens chunk by chunk, rather than fully sequentially or fully in parallel, which requires O(b) memory, where b is the size of the chunk.

Consider existing attention models as RNNs. By treating attention as an RNN, existing attention-based models can also be viewed as variants of RNN. For example, the Transformer's self-attention is an RNN (Figure 1b), and the context token is its initial hidden state. Perceiver’s cross-attention is an RNN (Figure 1c) whose initial hidden state is a context-dependent latent variable. By leveraging RNN forms of their attention mechanism, these existing models can efficiently compute their output stores.

However, when considering existing attention-based models (such as Transformers) as RNNs, these models lack the features commonly seen in traditional RNNs (such as LSTM and GRU). important attributes.

It is worth noting that LSTM and GRU can effectively update themselves with new tokens in only O(1) constant memory and computation, in contrast to Transformer's The RNN view (see Figure 1b) handles new tokens by adding a new RNN with the new token as the initial state. This new RNN processes all previous tokens, requiring O(N) linear computation.

In Perceiver, due to its architecture, latent variables (L_i in Figure 1c) are input-dependent, which means that their values are dependent on receiving new tokens. will change from time to time. As the initial hidden state (i.e. latent variables) of its RNN changes, Perceiver therefore needs to recompute its RNN from scratch, requiring a linear amount of computation of O (NL), where N is the number of tokens and L is the number of latent variables.

Consider attention as a many-to-many RNN

To target these limitations , the authors propose to develop an attention-based model that leverages the power of the RNN formulation to perform efficient updates. To this end, the author first introduced an efficient parallelization method, using attention as a many-to-many RNN calculation, that is, a method of parallel calculation Bengio等人新作：注意力可被视为RNN，新模型媲美Transformer，但超级省内存

. To this end, the authors utilize the parallel prefix scan algorithm (see Algorithm 1), a parallel computing method that computes N prefixes from N consecutive data points via the correlation operator ⊕.This algorithm can efficiently calculate Bengio等人新作：注意力可被视为RNN，新模型媲美Transformer，但超级省内存

##Review

, where

is for efficiency To calculate Bengio等人新作：注意力可被视为RNN，新模型媲美Transformer，但超级省内存

, you can calculate Bengio等人新作：注意力可被视为RNN，新模型媲美Transformer，但超级省内存

and

through a parallel scan algorithm, and then combine a_k and c_k to calculate Bengio等人新作：注意力可被视为RNN，新模型媲美Transformer，但超级省内存

To this end, the author proposes the following correlation operator ⊕, which acts on the form (m_A, u_A, w_A) A triplet, where A is a set of indices,

. The input to the parallel scan algorithm is Bengio等人新作：注意力可被视为RNN，新模型媲美Transformer，但超级省内存

. The algorithm recursively applies the operator ⊕ and works as follows: Bengio等人新作：注意力可被视为RNN，新模型媲美Transformer，但超级省内存

, where Bengio等人新作：注意力可被视为RNN，新模型媲美Transformer，但超级省内存 , .

After completing the recursive application of the operator, the algorithm outputs Bengio等人新作：注意力可被视为RNN，新模型媲美Transformer，但超级省内存

. Also known as Bengio等人新作：注意力可被视为RNN，新模型媲美Transformer，但超级省内存

. Combining the last two values of the output tuple, Bengio等人新作：注意力可被视为RNN，新模型媲美Transformer，但超级省内存

is retrieved resulting in an efficient parallel method of computing attention as a many-to-many RNN (Figure 3).

##Aaren：[A] attention [a] s a [re] current neural [n] etwork

Aaren's interface is the same as Transformer, that is, N inputs are mapped to N outputs, and the i-th output is the 1st to Aggregation of i inputs. In addition, Aaren is naturally stackable and can calculate separate loss terms for each sequence token. However, unlike Transformers that use causal self-attention, Aaren uses the above method of computing attention as a many-to-many RNN, making it more efficient. The form of Aaren is as follows:

##Different from Transformer, the query in Transformer is input to attention One of the tokens, and in Aaren, the query token q is learned through backpropagation during the training process.

The following figure shows an example of a stacked Aaren model. The input context token of the model is x_1:3 and the output is y_1:3. It is worth noting that since Aaren utilizes the attention mechanism in the form of RNN, stacking Aarens is also equivalent to stacking RNN. Therefore, Aarens is also able to efficiently update with new tokens, i.e. the iterative computation of y_k only requires constant computation since it only depends on h_k-1 and x_k.

Transformer-based models require linear memory (when using KV cache) and need to store all Previous tokens, including those in the intermediate Transformer layer, but Aarens-based models only require constant memory and do not need to store all previous tokens, which makes Aarens significantly better than Transformer in computational efficiency.

Experiment

The goal of the experimental part is to compare the performance and performance of Aaren and Transformer Performance in terms of resources (time and memory) required. For a comprehensive comparison, the authors performed evaluations on four problems: reinforcement learning, event prediction, time series prediction, and time series classification.

Reinforcement Learning

The author first compared Aaren and Transformer in reinforcement learning Performance. Reinforcement learning is popular in interactive environments such as robotics, recommendation engines, and traffic control.

The results in Table 1 show that Aaren performs comparably with Transformer across all 12 datasets and 4 environments. However, unlike Transformer, Aaren is also an RNN and therefore can efficiently handle new environmental interactions in continuous computation, making it more suitable for reinforcement learning.

Event prediction

Next, The authors compare the performance of Aaren and Transformer in event prediction. Event prediction is popular in many real-world settings, such as finance (e.g., transactions), healthcare (e.g., patient observation), and e-commerce (e.g., purchases).

#The results in Table 2 show that Aaren performs comparably to Transformer on all datasets.Aaren's ability to efficiently process new inputs is particularly useful in event prediction environments, where events occur in irregular streams.

Time series prediction

Then, the author compared Aaren and Transformer in time series Performance in Forecasting. Time series forecasting models are commonly used in areas related to climate (such as weather), energy (such as supply and demand), and economics (such as stock prices).

#The results in Table 3 show that Aaren performs comparably to Transformer on all datasets. However, unlike Transformer, Aaren can efficiently process time series data, making it more suitable for time series-related fields.

Time series classification

Next, the author compared Aaren and Transformer in time Performance in sequence classification. Time series classification is common in many important applications, such as pattern recognition (e.g. electrocardiogram), anomaly detection (e.g. bank fraud) or fault prediction (e.g. power grid fluctuations).

As can be seen from Table 4, the performance of Aaren and Transformer is comparable on all data sets.

Analysis

Finally, the author compares the resources required by Aaren and Transformer.

Memory Complexity: In Figure 5 (left), the authors compare the memory usage of Aaren and Transformer (using KV cache) at inference time. It can be seen that with the use of KV cache technology, the memory usage of Transformer increases linearly. In contrast, Aaren only uses a constant amount of memory regardless of how the number of tokens grows, so it is much more efficient.

Time complexity: In Figure 5 (right picture), the author compares the cumulative time required by Aaren and Transformer (using KV cache) to process a string of tokens in sequence. . For Transformer, the cumulative calculation amount is the square of the number of tokens, that is, O (1 + 2 + ... + N) = O (N^2). In contrast, Aaren's cumulative computational effort is linear. In the figure, you can see that the cumulative time required by the model has similar results. Specifically, the cumulative time required by Transformer increases quadratically, while the cumulative time required by Aaren increases linearly.

Number of parameters: Due to the need to learn the initial hidden state q, the Aaren module requires slightly more parameters than the Transformer module. However, since q is just a vector, the difference is not significant. Through empirical measurements on similar models, the authors found that Transformer used 3, 152, 384 parameters. By comparison, the equivalent Aaren uses 3,152,896 parameters, a parameter increase of only 0.016%—a negligible price to pay for the significant difference in memory and time complexity.

以上是Bengio等人新作：注意力可被视为RNN，新模型媲美Transformer，但超级省内存的详细内容。更多信息请关注PHP中文网其他相关文章！

本站声明

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系admin@php.cn

热AI工具

Undresser.AI Undress

人工智能驱动的应用程序，用于创建逼真的裸体照片

AI Clothes Remover

用于从照片中去除衣服的在线人工智能工具。

Undress AI Tool

免费脱衣服图片

Clothoff.io

AI脱衣机

Video Face Swap

使用我们完全免费的人工智能换脸工具轻松在任何视频中换脸！

显示更多

热工具

记事本++7.3.1

好用且免费的代码编辑器

SublimeText3汉化版

中文版，非常好用

禅工作室 13.0.1

功能强大的PHP集成开发环境

Dreamweaver CS6

视觉化网页开发工具

SublimeText3 Mac版

神级代码编辑软件(SublimeText3)

显示更多

热门话题

Java教程

1676

CakePHP 教程

1429

Laravel 教程

1333

PHP教程

1278

C# 教程

1257

显示更多

Related knowledge

$突破传统缺陷检测的界限，\'Defect Spectrum\'首次实现超高精度丰富语义的工业缺陷检测。$ 突破传统缺陷检测的界限，\'Defect Spectrum\'首次实现超高精度丰富语义的工业缺陷检测。 Jul 26, 2024 pm 05:38 PM

在现代制造业中，精准的缺陷检测不仅是保证产品质量的关键，更是提升生产效率的核心。然而，现有的缺陷检测数据集常常缺乏实际应用所需的精确度和语义丰富性，导致模型无法识别具体的缺陷类别或位置。为了解决这一难题，由香港科技大学广州和思谋科技组成的顶尖研究团队，创新性地开发出了“DefectSpectrum”数据集，为工业缺陷提供了详尽、语义丰富的大规模标注。如表一所示，相比其他工业数据集，“DefectSpectrum”数据集提供了最多的缺陷标注（5438张缺陷样本），最细致的缺陷分类（125种缺陷类别

数百万晶体数据训练，解决晶体学相位问题，深度学习方法PhAI登Science Aug 08, 2024 pm 09:22 PM

编辑|KX时至今日，晶体学所测定的结构细节和精度，从简单的金属到大型膜蛋白，是任何其他方法都无法比拟的。然而，最大的挑战——所谓的相位问题，仍然是从实验确定的振幅中检索相位信息。丹麦哥本哈根大学研究人员，开发了一种解决晶体相问题的深度学习方法PhAI，利用数百万人工晶体结构及其相应的合成衍射数据训练的深度学习神经网络，可以生成准确的电子密度图。研究表明，这种基于深度学习的从头算结构解决方案方法，可以以仅2埃的分辨率解决相位问题，该分辨率仅相当于原子分辨率可用数据的10%到20%，而传统的从头算方

英伟达对话模型ChatQA进化到2.0版本，上下文长度提到128K Jul 26, 2024 am 08:40 AM

开放LLM社区正是百花齐放、竞相争鸣的时代，你能看到Llama-3-70B-Instruct、QWen2-72B-Instruct、Nemotron-4-340B-Instruct、Mixtral-8x22BInstruct-v0.1等许多表现优良的模型。但是，相比于以GPT-4-Turbo为代表的专有大模型，开放模型在很多领域依然还有明显差距。在通用模型之外，也有一些专精关键领域的开放模型已被开发出来，比如用于编程和数学的DeepSeek-Coder-V2、用于视觉-语言任务的InternVL

谷歌AI拿下IMO奥数银牌，数学推理模型AlphaProof面世，强化学习 is so back Jul 26, 2024 pm 02:40 PM

对于AI来说，奥数不再是问题了。本周四，谷歌DeepMind的人工智能完成了一项壮举：用AI做出了今年国际数学奥林匹克竞赛IMO的真题，并且距拿金牌仅一步之遥。上周刚刚结束的IMO竞赛共有六道赛题，涉及代数、组合学、几何和数论。谷歌提出的混合AI系统做对了四道，获得28分，达到了银牌水平。本月初，UCLA终身教授陶哲轩刚刚宣传了百万美元奖金的AI数学奥林匹克竞赛（AIMO进步奖），没想到7月还没过，AI的做题水平就进步到了这种水平。IMO上同步做题，做对了最难题IMO是历史最悠久、规模最大、最负

PRO | 为什么基于 MoE 的大模型更值得关注？ Aug 07, 2024 pm 07:08 PM

2023年，几乎AI的每个领域都在以前所未有的速度进化，同时，AI也在不断地推动着具身智能、自动驾驶等关键赛道的技术边界。多模态趋势下，Transformer作为AI大模型主流架构的局面是否会撼动？为何探索基于MoE（专家混合）架构的大模型成为业内新趋势？大型视觉模型（LVM）能否成为通用视觉的新突破？...我们从过去的半年发布的2023年本站PRO会员通讯中，挑选了10份针对以上领域技术趋势、产业变革进行深入剖析的专题解读，助您在新的一年里为大展宏图做好准备。本篇解读来自2023年Week50

为大模型提供全新科学复杂问答基准与测评体系，UNSW、阿贡、芝加哥大学等多家机构联合推出SciQAG框架 Jul 25, 2024 am 06:42 AM

编辑|ScienceAI问答（QA）数据集在推动自然语言处理（NLP）研究发挥着至关重要的作用。高质量QA数据集不仅可以用于微调模型，也可以有效评估大语言模型（LLM）的能力，尤其是针对科学知识的理解和推理能力。尽管当前已有许多科学QA数据集，涵盖了医学、化学、生物等领域，但这些数据集仍存在一些不足。其一，数据形式较为单一，大多数为多项选择题（multiple-choicequestions），它们易于进行评估，但限制了模型的答案选择范围，无法充分测试模型的科学问题解答能力。相比之下，开放式问答

准确率达60.8%，浙大基于Transformer的化学逆合成预测模型，登Nature子刊 Aug 06, 2024 pm 07:34 PM

编辑|KX逆合成是药物发现和有机合成中的一项关键任务，AI越来越多地用于加快这一过程。现有AI方法性能不尽人意，多样性有限。在实践中，化学反应通常会引起局部分子变化，反应物和产物之间存在很大重叠。受此启发，浙江大学侯廷军团队提出将单步逆合成预测重新定义为分子串编辑任务，迭代细化目标分子串以生成前体化合物。并提出了基于编辑的逆合成模型EditRetro，该模型可以实现高质量和多样化的预测。大量实验表明，模型在标准基准数据集USPTO-50 K上取得了出色的性能，top-1准确率达到60.8%。

Nature观点，人工智能在医学中的测试一片混乱，应该怎么做？ Aug 22, 2024 pm 04:37 PM

编辑|ScienceAI基于有限的临床数据，数百种医疗算法已被批准。科学家们正在讨论由谁来测试这些工具，以及如何最好地进行测试。DevinSingh在急诊室目睹了一名儿科患者因长时间等待救治而心脏骤停，这促使他探索AI在缩短等待时间中的应用。Singh利用了SickKids急诊室的分诊数据，与同事们建立了一系列AI模型，用于提供潜在诊断和推荐测试。一项研究表明，这些模型可以加快22.3%的就诊速度，将每位需要进行医学检查的患者的结果处理速度加快近3小时。然而，人工智能算法在研究中的成功只是验证此

See all articles

Bengio等人新作：注意力可被视为RNN，新模型媲美Transformer，但超级省内存

热AI工具

Undresser.AI Undress

AI Clothes Remover

Undress AI Tool

Clothoff.io

Video Face Swap

热门文章

热工具

记事本++7.3.1

SublimeText3汉化版

禅工作室 13.0.1

Dreamweaver CS6

SublimeText3 Mac版

热门话题