Home Technology peripherals AI Non-Transformer architecture stands up! The first pure attention-free large model, surpassing the open source giant Llama 3.1

Non-Transformer architecture stands up! The first pure attention-free large model, surpassing the open source giant Llama 3.1

Aug 13, 2024 pm 04:37 PM
industry mamba

맘바 아키텍처의 대형 모델이 다시 한 번 트랜스포머에 도전했습니다.

이번에는 드디어 Mamba 아키텍처 모델이 "일어설" 것인가? Mamba는 2023년 12월 처음 출시된 이후 Transformer의 심각한 경쟁자로 등장했습니다.

이후 Mistral에서 출시한 Mamba 아키텍처 기반 최초의 오픈소스 대형 모델인 Codestral 7B 등 Mamba 아키텍처를 사용하는 모델이 계속해서 등장했습니다.

오늘 아부다비 기술 혁신 연구소(TII)는 새로운 오픈 소스 Mamba 모델인 Falcon Mamba 7B를 출시했습니다.

非Transformer架构站起来了!首个纯无注意力大模型,超越开源巨头Llama 3.1

먼저 Falcon Mamba 7B의 주요 특징을 요약해 보겠습니다. 메모리 저장 용량을 늘리지 않고도 모든 길이의 시퀀스를 처리할 수 있으며 단일 24GB A10 GPU에서 실행할 수 있습니다.

현재 Hugging Face에서 Falcon Mamba 7B를 보고 사용할 수 있습니다. 이 인과 디코더 전용 모델은 새로운 Mamba State Space Language Model(SSLM) 아키텍처를 사용하여 다양한 텍스트 생성 작업을 처리합니다.

결과에 따르면 Falcon Mamba 7B는 Meta의 Llama 3 8B, Llama 3.1 8B 및 Mistral 7B를 포함한 여러 벤치마크에서 동급 크기의 주요 모델을 능가합니다.

非Transformer架构站起来了!首个纯无注意力大模型,超越开源巨头Llama 3.1

Falcon Mamba 7B는 기본 버전, 명령 미세 조정 버전, 4비트 버전 및 명령 미세 조정 4비트 버전의 네 가지 변형 모델로 나뉩니다.

非Transformer架构站起来了!首个纯无注意力大模型,超越开源巨头Llama 3.1

Falcon Mamba 7B는 오픈 소스 모델로서 Apache 2.0 기반 라이선스 "Falcon License 2.0"을 채택하여 연구 및 응용 목적을 지원합니다.

非Transformer架构站起来了!首个纯无注意力大模型,超越开源巨头Llama 3.1

Hugging Face 주소: https://huggingface.co/tiiuae/falcon-mamba-7b

Falcon Mamba 7B는 Falcon 180B, Falcon 40B 및 Falcon 2 Four에 이어 세 번째 TII 오픈 소스가 되었습니다. 모델이며 최초의 Mamba SSLM 아키텍처 모델입니다.

非Transformer架构站起来了!首个纯无注意力大模型,超越开源巨头Llama 3.1

The first general-purpose large-scale pure Mamba model

For a long time, Transformer-based models have dominated generative AI. However, researchers have noticed that the Transformer architecture has difficulty processing long text information. Difficulties may be encountered.

Essentially, the attention mechanism in Transformer understands the context by comparing each word (or token) with each word in the text, which requires more computing power and memory requirements to handle the growing context window.

But if the computing resources are not expanded accordingly, the model inference speed will slow down, and text exceeding a certain length cannot be processed. To overcome these obstacles, the State Space Language Model (SSLM) architecture, which works by continuously updating the state while processing words, has emerged as a promising alternative and is being deployed by many institutions including TII. This kind of architecture.

Falcon Mamba 7B uses the Mamba SSM architecture originally proposed in a December 2023 paper by researchers at Carnegie Mellon University and Princeton University.

The architecture uses a selection mechanism that allows the model to dynamically adjust its parameters based on the input. In this way, the model can focus on or ignore specific inputs, similar to how the attention mechanism works in Transformer, while providing the ability to process long sequences of text (such as entire books) without requiring additional memory or computing resources.

TII noted that this approach makes the model suitable for tasks such as enterprise-level machine translation, text summarization, computer vision and audio processing tasks, and estimation and prediction.

Training Data

Falcon Mamba 7B Training data is up to 5500GT, mainly composed of RefinedWeb dataset, with the addition of high-quality technical data, code data and mathematical data from public sources. All data is tokenized using Falcon-7B/11B tokenizers.

Similar to other Falcon series models, Falcon Mamba 7B is trained using a multi-stage training strategy, the context length is increased from 2048 to 8192. In addition, inspired by the concept of course learning, TII carefully selects mixed data throughout the training phase, fully considering the diversity and complexity of the data.

In the final training stage, TII uses a small set of high-quality curated data (i.e. samples from Fineweb-edu) to further improve performance.

Training process, hyperparameters

Most of the training of Falcon Mamba 7B is completed on 256 H100 80GB GPUs, using 3D parallelism (TP=1, PP=1, DP=256) strategy combined with ZeRO. The figure below shows the model hyperparameter details, including accuracy, optimizer, maximum learning rate, weight decay and batch size.

非Transformer架构站起来了!首个纯无注意力大模型,超越开源巨头Llama 3.1

Specifically, Falcon Mamba 7B was trained with the AdamW optimizer, WSD (warm-stabilize-decay) learning rate plan, and during the training process of the first 50 GT, the batch size increased from b_min=128 to b_max=2048 .

In the stable phase, TII uses the maximum learning rate η_max=6.4×10^−4, and then decays it to the minimum value using an exponential plan over 500GT非Transformer架构站起来了!首个纯无注意力大模型,超越开源巨头Llama 3.1. At the same time, TII uses BatchScaling in the acceleration phase to re-adjust the learning rate η so that the Adam noise temperature 非Transformer架构站起来了!首个纯无注意力大模型,超越开源巨头Llama 3.1 remains constant.

The entire model training took about two months.

Model Evaluation

To understand how Falcon Mamba 7B compares to leading Transformer models in its size class, the study conducted a test to determine what the model can handle using a single 24GB A10 GPU Maximum context length.

The results show that Falcon Mamba is able to adapt to larger sequences than the current Transformer model, while theoretically able to adapt to unlimited context lengths.

非Transformer架构站起来了!首个纯无注意力大模型,超越开源巨头Llama 3.1

Next, the researchers measured the model generation throughput using a batch size of 1 and a hardware setting of H100 GPU. The results are shown in the figure below, Falcon Mamba generates all tokens at constant throughput without any increase in CUDA peak memory. For Transformer models, peak memory increases and generation speed slows down as the number of tokens generated increases.

非Transformer架构站起来了!首个纯无注意力大模型,超越开源巨头Llama 3.1

Even on standard industry benchmarks, the new model performs better than or close to popular transformer models as well as pure and hybrid state-space models.

For example, in the Arc, TruthfulQA and GSM8K benchmarks, Falcon Mamba 7B scored 62.03%, 53.42% and 52.54% respectively, surpassing Llama 3 8B, Llama 3.1 8B, Gemma 7B and Mistral 7B. However, the Falcon Mamba 7B lags far behind these models in the MMLU and Hellaswag benchmarks.

非Transformer架构站起来了!首个纯无注意力大模型,超越开源巨头Llama 3.1

非Transformer架构站起来了!首个纯无注意力大模型,超越开源巨头Llama 3.1

TII principal investigator Hakim Hacid said in a statement: The launch of Falcon Mamba 7B represents a major step forward for the agency, inspiring new perspectives and furthering the push for intelligence Systematic exploration. At TII, they are pushing the boundaries of SSLM and transformer models to inspire further innovation in generative AI.

Currently, TII’s Falcon family of language models has been downloaded more than 45 million times – making it one of the most successful LLM versions in the UAE.

Falcon Mamba 7B paper will be released soon, you can wait a moment.

Reference link:
https://huggingface.co/blog/falconmamba
https://venturebeat.com/ai/falcon-mamba-7bs-powerful -new-ai-architecture-offers-alternative-to-transformer-models/

The above is the detailed content of Non-Transformer architecture stands up! The first pure attention-free large model, surpassing the open source giant Llama 3.1. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
WWE 2K25: How To Unlock Everything In MyRise
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

DeepMind robot plays table tennis, and its forehand and backhand slip into the air, completely defeating human beginners DeepMind robot plays table tennis, and its forehand and backhand slip into the air, completely defeating human beginners Aug 09, 2024 pm 04:01 PM

But maybe he can’t defeat the old man in the park? The Paris Olympic Games are in full swing, and table tennis has attracted much attention. At the same time, robots have also made new breakthroughs in playing table tennis. Just now, DeepMind proposed the first learning robot agent that can reach the level of human amateur players in competitive table tennis. Paper address: https://arxiv.org/pdf/2408.03906 How good is the DeepMind robot at playing table tennis? Probably on par with human amateur players: both forehand and backhand: the opponent uses a variety of playing styles, and the robot can also withstand: receiving serves with different spins: However, the intensity of the game does not seem to be as intense as the old man in the park. For robots, table tennis

The first mechanical claw! Yuanluobao appeared at the 2024 World Robot Conference and released the first chess robot that can enter the home The first mechanical claw! Yuanluobao appeared at the 2024 World Robot Conference and released the first chess robot that can enter the home Aug 21, 2024 pm 07:33 PM

On August 21, the 2024 World Robot Conference was grandly held in Beijing. SenseTime's home robot brand "Yuanluobot SenseRobot" has unveiled its entire family of products, and recently released the Yuanluobot AI chess-playing robot - Chess Professional Edition (hereinafter referred to as "Yuanluobot SenseRobot"), becoming the world's first A chess robot for the home. As the third chess-playing robot product of Yuanluobo, the new Guoxiang robot has undergone a large number of special technical upgrades and innovations in AI and engineering machinery. For the first time, it has realized the ability to pick up three-dimensional chess pieces through mechanical claws on a home robot, and perform human-machine Functions such as chess playing, everyone playing chess, notation review, etc.

Claude has become lazy too! Netizen: Learn to give yourself a holiday Claude has become lazy too! Netizen: Learn to give yourself a holiday Sep 02, 2024 pm 01:56 PM

The start of school is about to begin, and it’s not just the students who are about to start the new semester who should take care of themselves, but also the large AI models. Some time ago, Reddit was filled with netizens complaining that Claude was getting lazy. "Its level has dropped a lot, it often pauses, and even the output becomes very short. In the first week of release, it could translate a full 4-page document at once, but now it can't even output half a page!" https:// www.reddit.com/r/ClaudeAI/comments/1by8rw8/something_just_feels_wrong_with_claude_in_the/ in a post titled "Totally disappointed with Claude", full of

At the World Robot Conference, this domestic robot carrying 'the hope of future elderly care' was surrounded At the World Robot Conference, this domestic robot carrying 'the hope of future elderly care' was surrounded Aug 22, 2024 pm 10:35 PM

At the World Robot Conference being held in Beijing, the display of humanoid robots has become the absolute focus of the scene. At the Stardust Intelligent booth, the AI ​​robot assistant S1 performed three major performances of dulcimer, martial arts, and calligraphy in one exhibition area, capable of both literary and martial arts. , attracted a large number of professional audiences and media. The elegant playing on the elastic strings allows the S1 to demonstrate fine operation and absolute control with speed, strength and precision. CCTV News conducted a special report on the imitation learning and intelligent control behind "Calligraphy". Company founder Lai Jie explained that behind the silky movements, the hardware side pursues the best force control and the most human-like body indicators (speed, load) etc.), but on the AI ​​side, the real movement data of people is collected, allowing the robot to become stronger when it encounters a strong situation and learn to evolve quickly. And agile

ACL 2024 Awards Announced: One of the Best Papers on Oracle Deciphering by HuaTech, GloVe Time Test Award ACL 2024 Awards Announced: One of the Best Papers on Oracle Deciphering by HuaTech, GloVe Time Test Award Aug 15, 2024 pm 04:37 PM

At this ACL conference, contributors have gained a lot. The six-day ACL2024 is being held in Bangkok, Thailand. ACL is the top international conference in the field of computational linguistics and natural language processing. It is organized by the International Association for Computational Linguistics and is held annually. ACL has always ranked first in academic influence in the field of NLP, and it is also a CCF-A recommended conference. This year's ACL conference is the 62nd and has received more than 400 cutting-edge works in the field of NLP. Yesterday afternoon, the conference announced the best paper and other awards. This time, there are 7 Best Paper Awards (two unpublished), 1 Best Theme Paper Award, and 35 Outstanding Paper Awards. The conference also awarded 3 Resource Paper Awards (ResourceAward) and Social Impact Award (

Hongmeng Smart Travel S9 and full-scenario new product launch conference, a number of blockbuster new products were released together Hongmeng Smart Travel S9 and full-scenario new product launch conference, a number of blockbuster new products were released together Aug 08, 2024 am 07:02 AM

This afternoon, Hongmeng Zhixing officially welcomed new brands and new cars. On August 6, Huawei held the Hongmeng Smart Xingxing S9 and Huawei full-scenario new product launch conference, bringing the panoramic smart flagship sedan Xiangjie S9, the new M7Pro and Huawei novaFlip, MatePad Pro 12.2 inches, the new MatePad Air, Huawei Bisheng With many new all-scenario smart products including the laser printer X1 series, FreeBuds6i, WATCHFIT3 and smart screen S5Pro, from smart travel, smart office to smart wear, Huawei continues to build a full-scenario smart ecosystem to bring consumers a smart experience of the Internet of Everything. Hongmeng Zhixing: In-depth empowerment to promote the upgrading of the smart car industry Huawei joins hands with Chinese automotive industry partners to provide

Li Feifei's team proposed ReKep to give robots spatial intelligence and integrate GPT-4o Li Feifei's team proposed ReKep to give robots spatial intelligence and integrate GPT-4o Sep 03, 2024 pm 05:18 PM

Deep integration of vision and robot learning. When two robot hands work together smoothly to fold clothes, pour tea, and pack shoes, coupled with the 1X humanoid robot NEO that has been making headlines recently, you may have a feeling: we seem to be entering the age of robots. In fact, these silky movements are the product of advanced robotic technology + exquisite frame design + multi-modal large models. We know that useful robots often require complex and exquisite interactions with the environment, and the environment can be represented as constraints in the spatial and temporal domains. For example, if you want a robot to pour tea, the robot first needs to grasp the handle of the teapot and keep it upright without spilling the tea, then move it smoothly until the mouth of the pot is aligned with the mouth of the cup, and then tilt the teapot at a certain angle. . this

Tested 7 'Sora-level' video generation artifacts. Who has the ability to ascend to the 'Iron Throne'? Tested 7 'Sora-level' video generation artifacts. Who has the ability to ascend to the 'Iron Throne'? Aug 05, 2024 pm 07:19 PM

Editor of Machine Power Report: Yang Wen Who can become the King of AI video circle? In the American TV series "Game of Thrones", there is an "Iron Throne". Legend has it that it was made by the giant dragon "Black Death" who melted thousands of swords discarded by enemies, symbolizing supreme authority. In order to sit on this iron chair, the major families started fighting and fighting. Since the emergence of Sora, a vigorous "Game of Thrones" has been launched in the AI ​​video circle. The main players in this game include RunwayGen-3 and Luma from across the ocean, as well as domestic Kuaishou Keling, ByteDream, and Zhimo. Spectrum Qingying, Vidu, PixVerseV2, etc. Today we are going to evaluate and see who is qualified to sit on the "Iron Throne" of the AI ​​video circle. -1- Vincent Video

See all articles