RNN efficiency is comparable to Transformer, Google's new architecture has two consecutive releases: it is stronger than Mamba at the same scale-AI-php.cn

Home

RNN efficiency is comparable to Transformer, Google's new architecture has two consecutive releases: it is stronger than Mamba at the same scale

王林

Aug 05, 2024 pm 02:20 PM

industry mamba Griffin Hawk

In December last year, the new architecture Mamba detonated the AI circle and launched a challenge to the ever-standing Transformer. Today, the launch of Google DeepMind “Hawk” and “Griffin” provides new options for the AI circle.

This time, Google DeepMind has made new moves in basic models.

We know that recurrent neural networks (RNN) played a central role in the early days of deep learning and natural language processing research and have achieved practical results in many applications, including Google’s first end-to-end machine translation system . However, in recent years, deep learning and NLP have been dominated by the Transformer architecture, which combines multi-layer perceptron (MLP) and multi-head attention (MHA).

Transformer has achieved better performance than RNN in practice and is also very efficient in leveraging modern hardware. Transformer-based large language models are trained on massive datasets collected from the web with remarkable success.

Even though it has achieved great success, the Transformer architecture still has shortcomings. For example, due to the quadratic complexity of global attention, Transformer is difficult to effectively extend to long sequences. Additionally, the key-value (KV) cache grows linearly with sequence length, causing Transformer to slow down during inference. At this point, recurrent language models become an alternative, they can compress the entire sequence into a fixed-size hidden state and update it iteratively. But if it wants to replace Transformer, the new RNN model must not only show comparable performance in scaling, but also achieve similar hardware efficiency.

In a recent paper by Google DeepMind, researchers proposed the RG-LRU layer, which is a novel gated linear loop layer, and designed a new loop block around it to replace multi-query Attention (MQA).

They used this loop block to build two new models, One is the model Hawk that mixes MLP and loop blocks, The other is the model Griffin that mixes MLP with loop blocks and local attention. .

RNN efficiency is comparable to Transformer, Googles new architecture has two consecutive releases: it is stronger than Mamba at the same scale

Paper title: Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
Paper link: https://arxiv.org/pdf/2402.19427.pdf

The researchers say that Hawk and Griffin exhibit power-law scaling between held-out loss and training FLOPs, up to 7B parameters, as previously observed in Transformers. Among them, Griffin achieves slightly lower held-out loss than the powerful Transformer baseline at all model sizes.

RNN efficiency is comparable to Transformer, Googles new architecture has two consecutive releases: it is stronger than Mamba at the same scale

The researchers overtrained Hawk and Griffin on 300B tokens for a range of model sizes. The results showed that Hawk-3B surpassed Mamba-3B in the performance of downstream tasks, although the number of trained tokens was only half of the latter. Griffin-7B and Griffin-14B perform comparably to Llama-2 despite being trained on only 1/7 the number of tokens.

In addition, Hawk and Griffin achieved training efficiency comparable to Transformers on TPU-v3. Since the diagonal RNN layer is memory-constrained, the researchers used the kernel of the RG-LRU layer to achieve this.

Also during inference, both Hawk and Griffin achieve higher throughput than MQA Transformer and achieve lower latency when sampling long sequences. Griffin performs better than Transformers when the sequences being evaluated are longer than those observed in training, and can effectively learn copy and retrieval tasks from the training data. However, when the pre-trained models were evaluated on copy and exact retrieval tasks without fine-tuning, Hawk and Griffin performed worse than Transformers.

Co-author and DeepMind research scientist Aleksandar Botev said that Griffin, a model that mixes gated linear loops and local attention, retains all the high-efficiency advantages of RNN and the expressive capabilities of Transformer, and can be expanded up to 14B parameter scale.

RNN efficiency is comparable to Transformer, Googles new architecture has two consecutive releases: it is stronger than Mamba at the same scale ^{Source: https://twitter.com/botev_mg/status/1763489634082795780}

Griffin 模型架构

Griffin 所有模型都包含以下组成部分：(i) 一个残差块，(ii) 一个 MLP 块，(iii) 一个时间混合块。所有模型的 (i) 和 (ii) 都是相同的，但时间混合块有三个：全局多查询注意（MQA）、局部（滑动窗口）MQA 和本文提出的循环块。作为循环块的一部分，研究者使用了真实门控线性循环单元（RG-LRU）—— 一种受线性循环单元启发的新型循环层。

如图 2（a）所示，残差块定义了 Griffin 模型的全局结构，其灵感来自 pre-normTransformer。在嵌入输入序列后，研究者将其通过 ? 这样的块（? 表示模型深度），然后应用 RMSNorm 生成最终激活。为了计算 token 概率，应用了最后的线性层，然后是 softmax。该层的权重与输入嵌入层共享。

RNN efficiency is comparable to Transformer, Googles new architecture has two consecutive releases: it is stronger than Mamba at the same scale

循环模型，缩放效率媲美 Transformer

缩放研究为如何调整模型的超参数及其在缩放时的行为提供了重要见解。

研究者定义了本研究中进行评估的模型，并提供了高达和超过 7B 参数的缩放曲线，并评估了模型在下游任务中的性能。

他们考虑了 3 个模型系列：（1）MQA-Transformer 基线；（2）Hawk：纯 RNN 模型；（3）Griffin：混合模型，它将循环块与局部注意力混合在一起。附录 C 中定义了各种规模模型的关键模型超参数。

Hawk 架构使用了与 Transformer 基线相同的残差模式和 MLP 块，但研究者使用了带有 RG-LRU 层的循环块作为时序混合块，而不是 MQA。他们将循环块的宽度扩大了约 4/3 倍（即?_??? ≈4?/3），以便在两者使用相同的模型维度 ? 时，与 MHA 块的参数数量大致匹配。

Griffin。与全局注意力相比，循环块的主要优势在于它们使用固定的状态大小来总结序列，而 MQA 的 KV 缓存大小则与序列长度成正比增长。局部注意力具有相同的特性，而将循环块与局部注意力混合则可以保留这一优势。研究者发现这种组合极为高效，因为局部注意力能准确模拟最近的过去，而循环层则能在长序列中传递信息。

Griffin 使用了与 Transformer 基线相同的残差模式和 MLP 块。但与 MQA Transformer 基线和 Hawk 模型不同的是，Griffin 混合使用了循环块和 MQA 块。具体来说，研究者采用了一种分层结构，将两个残差块与一个循环块交替使用，然后再使用一个局部（MQA）注意力块。除非另有说明，局部注意力窗口大小固定为 1024 个 token。

主要缩放结果如图 1（a）所示。三个模型系列都是在从 1 亿到 70 亿个参数的模型规模范围内进行训练的，不过 Griffin 拥有 140 亿参数的版本。

在下游任务上的评估结果如表 1 所示：

RNN efficiency is comparable to Transformer, Googles new architecture has two consecutive releases: it is stronger than Mamba at the same scale

Hawk 和 Griffin 的表现都非常出色。上表报告了 MMLU、HellaSwag、PIQA、ARC-E 和 ARC-C 的特征归一化准确率，同时报告了 WinoGrande 的绝对准确率和部分评分。随着模型规模的增大，Hawk 的性能也得到了显着提高，Hawk-3B 在下游任务中的表现要强于 Mamba-3B，尽管其训练的 token 数量只有 Mamba-3B 的一半。 Griffin-3B 的性能明显优于 Mamba-3B，Griffin-7B 和 Griffin-14B 的性能可与 Llama-2 相媲美，尽管它们是在少了近 7 倍的 token 上训练出来的。 Hawk 能与 MQA Transformer 基线相媲美，而 Griffin 的表现则超过了这一基线。

在端侧高效训练循环模型

在开发和扩展模型时，研究者遇到了两大工程挑战。首先，如何在多台设备上高效地分片处理模型。第二，如何有效地实现线性循环，以最大限度地提高 TPU 的训练效率。本文讨论了这两个难题，然后对 Griffin 和 MQA 基线的训练速度进行实证比较。

研究者比较了不同模型大小和序列长度的训练速度，以研究本文模型在训练过程中的计算优势。对于每种模型大小，都保持每批 token 的总数固定不变，这意味着随着序列长度的增加，序列数量也会按比例减少。

图 3 绘制了 Griffin 模型与 MQA 基线模型在 2048 个序列长度下的相对运行时间。

RNN efficiency is comparable to Transformer, Googles new architecture has two consecutive releases: it is stronger than Mamba at the same scale

推理速度

LLM 的推理由两个阶段组成。「预填充」阶段是接收并处理 prompt。这一步实际上是对模型进行前向传递。由于prompt 可以在整个序列中并行处理，因此在这一阶段，大多数模型操作都是计算受限的因此，研究者预计Transformers 模型和循环模型在预填充阶段的相对速度与前文讨论的那些模型在训练期间的相对速度相似。

预填充之后是解码阶段，在这一阶段，研究者从模型中自回归地采 token。如下所示，尤其是对于序列长度较长时，注意力中使用的键值（KV）缓存变得很大，循环模型在解码阶段具有更低的延迟和更高的吞吐量。

评估推断速度时有两个主要指标需要考虑。第一个是延迟，它衡量在特定批量大小下生成指定数量 token 所需的时间。第二个是吞吐量，它衡量在单个设备上采样指定数量 token 时每秒可以生成的最大 token 数。因为吞吐量由采样的 token 数乘以批量大小除以延迟得出，所以可以通过减少延迟或减少内存使用以在设备上使用更大的批量大小来提高吞吐量。对于需要快速响应时间的实时应用来说，考虑延迟是有用的。吞吐量也值得考虑，因为它可以告诉我们在给定时间内可以从特定模型中采样的最大 token 数量。当考虑其他语言应用，如基于人类反馈的强化学习（RLHF）或评分语言模型输出（如AlphaCode 中所做的）时，这个属性是有吸引力的，因为能够在给定时间内输出大量token 是一个吸引人的特性。

在此，研究者研究了参数为 1B 的模型推理结果。在基线方面，它们与 MQA Transformer 进行了比较，后者在推理过程中的速度明显快于文献中常用的标准 MHA 变换器。研究者比较的模型有：i) MQA 变换器，ii) Hawk 和 iii) Griffin。为了比较不同的模型，我们报告了延迟和吞吐量。

如图 4 所示，研究者比较了批量大小为 16、空预填充和预填充 4096 个 token 的模型的延迟。

RNN efficiency is comparable to Transformer, Googles new architecture has two consecutive releases: it is stronger than Mamba at the same scale

图 1（b）中比较了相同模型在空提示后分别采样 512、1024、2048 和 4196 个 token 时的最大吞吐量（token / 秒）。

长上下文建模

本文还探讨了 Hawk 和 Griffin 使用较长上下文来改进下一个 token 预测的有效性，并研究它们在推理过程中的外推能力。此外还探讨了 Griffin 在需要复制和检索能力的任务中的表现，既包括在此类任务中训练的模型，也包括在使用预训练的语言模型测试这些能力时的表现。

从图 5 左侧的曲线图中，可以观察到，在一定的最大长度范围内，Hawk 和 Griffin 都能在更长的上下文中提高下一个 token 的预测能力，而且它们总体上能够推断出比训练时更长的序列（至少 4 倍）。尤其是 Griffin，即使在局部注意力层使用 RoPE 时，它的推理能力也非常出色。

RNN efficiency is comparable to Transformer, Googles new architecture has two consecutive releases: it is stronger than Mamba at the same scale

如图 6 所示，在选择性复制任务中，所有 3 个模型都能完美地完成任务。在比较该任务的学习速度时， Hawk 明显慢于 Transformer，这与 Jelassi et al. (2024) 的观察结果类似，他们发现 Mamba 在类似任务上的学习速度明显较慢。有趣的是，尽管 Griffin 只使用了一个局部注意力层，但它的学习速度几乎没有减慢，与 Transformer 的学习速度不相上下。

RNN efficiency is comparable to Transformer, Googles new architecture has two consecutive releases: it is stronger than Mamba at the same scale

更多细节，请阅读原论文。

The above is the detailed content of RNN efficiency is comparable to Transformer, Google's new architecture has two consecutive releases: it is stronger than Mamba at the same scale. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

How to fix KB5055523 fails to install in Windows 11?

4 weeks ago By DDD

How to fix KB5055518 fails to install in Windows 10?

4 weeks ago By DDD

Roblox: Grow A Garden - Complete Mutation Guide

3 weeks ago By DDD

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

How to fix KB5055612 fails to install in Windows 10?

3 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Java Tutorial

1664

CakePHP Tutorial

1422

Laravel Tutorial

1316

PHP Tutorial

1267

C# Tutorial

1239

Related knowledge

DeepMind robot plays table tennis, and its forehand and backhand slip into the air, completely defeating human beginners Aug 09, 2024 pm 04:01 PM

But maybe he can’t defeat the old man in the park? The Paris Olympic Games are in full swing, and table tennis has attracted much attention. At the same time, robots have also made new breakthroughs in playing table tennis. Just now, DeepMind proposed the first learning robot agent that can reach the level of human amateur players in competitive table tennis. Paper address: https://arxiv.org/pdf/2408.03906 How good is the DeepMind robot at playing table tennis? Probably on par with human amateur players: both forehand and backhand: the opponent uses a variety of playing styles, and the robot can also withstand: receiving serves with different spins: However, the intensity of the game does not seem to be as intense as the old man in the park. For robots, table tennis

The first mechanical claw! Yuanluobao appeared at the 2024 World Robot Conference and released the first chess robot that can enter the home Aug 21, 2024 pm 07:33 PM

On August 21, the 2024 World Robot Conference was grandly held in Beijing. SenseTime's home robot brand "Yuanluobot SenseRobot" has unveiled its entire family of products, and recently released the Yuanluobot AI chess-playing robot - Chess Professional Edition (hereinafter referred to as "Yuanluobot SenseRobot"), becoming the world's first A chess robot for the home. As the third chess-playing robot product of Yuanluobo, the new Guoxiang robot has undergone a large number of special technical upgrades and innovations in AI and engineering machinery. For the first time, it has realized the ability to pick up three-dimensional chess pieces through mechanical claws on a home robot, and perform human-machine Functions such as chess playing, everyone playing chess, notation review, etc.

Claude has become lazy too! Netizen: Learn to give yourself a holiday Sep 02, 2024 pm 01:56 PM

The start of school is about to begin, and it’s not just the students who are about to start the new semester who should take care of themselves, but also the large AI models. Some time ago, Reddit was filled with netizens complaining that Claude was getting lazy. "Its level has dropped a lot, it often pauses, and even the output becomes very short. In the first week of release, it could translate a full 4-page document at once, but now it can't even output half a page!" https:// www.reddit.com/r/ClaudeAI/comments/1by8rw8/something_just_feels_wrong_with_claude_in_the/ in a post titled "Totally disappointed with Claude", full of

At the World Robot Conference, this domestic robot carrying 'the hope of future elderly care' was surrounded Aug 22, 2024 pm 10:35 PM

At the World Robot Conference being held in Beijing, the display of humanoid robots has become the absolute focus of the scene. At the Stardust Intelligent booth, the AI robot assistant S1 performed three major performances of dulcimer, martial arts, and calligraphy in one exhibition area, capable of both literary and martial arts. , attracted a large number of professional audiences and media. The elegant playing on the elastic strings allows the S1 to demonstrate fine operation and absolute control with speed, strength and precision. CCTV News conducted a special report on the imitation learning and intelligent control behind "Calligraphy". Company founder Lai Jie explained that behind the silky movements, the hardware side pursues the best force control and the most human-like body indicators (speed, load) etc.), but on the AI side, the real movement data of people is collected, allowing the robot to become stronger when it encounters a strong situation and learn to evolve quickly. And agile

ACL 2024 Awards Announced: One of the Best Papers on Oracle Deciphering by HuaTech, GloVe Time Test Award Aug 15, 2024 pm 04:37 PM

At this ACL conference, contributors have gained a lot. The six-day ACL2024 is being held in Bangkok, Thailand. ACL is the top international conference in the field of computational linguistics and natural language processing. It is organized by the International Association for Computational Linguistics and is held annually. ACL has always ranked first in academic influence in the field of NLP, and it is also a CCF-A recommended conference. This year's ACL conference is the 62nd and has received more than 400 cutting-edge works in the field of NLP. Yesterday afternoon, the conference announced the best paper and other awards. This time, there are 7 Best Paper Awards (two unpublished), 1 Best Theme Paper Award, and 35 Outstanding Paper Awards. The conference also awarded 3 Resource Paper Awards (ResourceAward) and Social Impact Award (

Li Feifei's team proposed ReKep to give robots spatial intelligence and integrate GPT-4o Sep 03, 2024 pm 05:18 PM

Deep integration of vision and robot learning. When two robot hands work together smoothly to fold clothes, pour tea, and pack shoes, coupled with the 1X humanoid robot NEO that has been making headlines recently, you may have a feeling: we seem to be entering the age of robots. In fact, these silky movements are the product of advanced robotic technology + exquisite frame design + multi-modal large models. We know that useful robots often require complex and exquisite interactions with the environment, and the environment can be represented as constraints in the spatial and temporal domains. For example, if you want a robot to pour tea, the robot first needs to grasp the handle of the teapot and keep it upright without spilling the tea, then move it smoothly until the mouth of the pot is aligned with the mouth of the cup, and then tilt the teapot at a certain angle. . this

Hongmeng Smart Travel S9 and full-scenario new product launch conference, a number of blockbuster new products were released together Aug 08, 2024 am 07:02 AM

This afternoon, Hongmeng Zhixing officially welcomed new brands and new cars. On August 6, Huawei held the Hongmeng Smart Xingxing S9 and Huawei full-scenario new product launch conference, bringing the panoramic smart flagship sedan Xiangjie S9, the new M7Pro and Huawei novaFlip, MatePad Pro 12.2 inches, the new MatePad Air, Huawei Bisheng With many new all-scenario smart products including the laser printer X1 series, FreeBuds6i, WATCHFIT3 and smart screen S5Pro, from smart travel, smart office to smart wear, Huawei continues to build a full-scenario smart ecosystem to bring consumers a smart experience of the Internet of Everything. Hongmeng Zhixing: In-depth empowerment to promote the upgrading of the smart car industry Huawei joins hands with Chinese automotive industry partners to provide

Distributed Artificial Intelligence Conference DAI 2024 Call for Papers: Agent Day, Richard Sutton, the father of reinforcement learning, will attend! Yan Shuicheng, Sergey Levine and DeepMind scientists will give keynote speeches Aug 22, 2024 pm 08:02 PM

Conference Introduction With the rapid development of science and technology, artificial intelligence has become an important force in promoting social progress. In this era, we are fortunate to witness and participate in the innovation and application of Distributed Artificial Intelligence (DAI). Distributed artificial intelligence is an important branch of the field of artificial intelligence, which has attracted more and more attention in recent years. Agents based on large language models (LLM) have suddenly emerged. By combining the powerful language understanding and generation capabilities of large models, they have shown great potential in natural language interaction, knowledge reasoning, task planning, etc. AIAgent is taking over the big language model and has become a hot topic in the current AI circle. Au

See all articles