Algorithms, systems and applications, a comprehensive understanding of hybrid experts (MoE) from three perspectives-AI-php.cn

LLM is very strong, and in order to achieve sustainable expansion of LLM, it is necessary to find and implement methods that can improve its efficiency. The hybrid expert (MoE) is an important member of this type of method.

Recently, the new generation of large models proposed by various technology companies are all using the Mixture of Experts (MoE) method.

The concept of hybrid experts was first born in the paper "Adaptive mixtures of local experts" in 1991. It has been extensively explored and developed for more than 30 years. In recent years, with the emergence and development of sparse-gated MoE, especially when combined with large-scale language models based on Transformer, this more than thirty-year-old technology has taken on new life.

The MoE framework is based on a simple yet powerful idea: different parts of the model (called experts) focus on different tasks or different aspects of the data.

When using this paradigm, for an input, only experts related to it will participate in the processing, so that the computational cost can be controlled while still benefiting from a large amount of professional knowledge. Therefore, MoE can improve the capabilities of large language models without significantly increasing computational requirements.

As shown in Figure 1, MoE-related research has grown strongly, especially after the emergence of Mixtral-8x7B and various industrial-level LLMs such as Grok-1, DBRX, Arctic, DeepSeek-V2, etc. in 2024.

Algorithms, systems and applications, a comprehensive understanding of hybrid experts (MoE) from three perspectives

This picture comes from a MoE review report recently released by a research team from the Hong Kong University of Science and Technology (Guangzhou). It clearly and comprehensively summarizes MoE-related research and proposes a new classification method. These studies are classified into three categories: algorithms, systems and applications.

Algorithms, systems and applications, a comprehensive understanding of hybrid experts (MoE) from three perspectives

Paper title: A Survey on Mixture of Experts
Paper address: https://arxiv.org/pdf/2407.06204

This site has compiled the main content of this review report. To help readers understand the current development overview of MoE, please read the original paper for more details. In addition, we have also compiled some MoE-related reports at the end of the article.

Background knowledge of hybrid experts

In a large language model (LLM) based on Transformer, the composition of each hybrid expert (MoE) layer is usually? "Expert Network" {?_1, ... , ?_?} Paired with a "gating network" G.

This gating network is usually in the form of a linear network using a softmax activation function, whose role is to guide the input to the appropriate expert network. The MoE layer is placed in the Transformer module, and its function is to select the forward network (FFN), usually located after the self-attention (SA) sub-layer. This placement is critical because as the model grows, the computational requirements of the FFN increase. For example, in the PaLM model with 540 billion parameters, 90% of the parameters are located in its FFN layer.

Described in mathematical form: Each expert network ?_? (usually a linear - ReLU - linear network) is parameterized by W_?, which receives the same input x and generates an output ?_? (x; W_? ). At the same time, a gated network G with parameters Θ (usually composed of a linear-ReLU-linear-softmax network) gets the output G (x; Θ). According to the design method of the gating function, the MoE layer can be roughly divided into the following two categories.

Algorithms, systems and applications, a comprehensive understanding of hybrid experts (MoE) from three perspectives

Dense MoE

The dense mixed expert layer is to activate all expert networks {?_1, ... , ?_?} during each iteration. This strategy was commonly adopted by early MoE studies. In recent times, there have been some studies using dense MoE, such as EvoMoE, MoLE, LoRAMoE and DS-MoE. Figure 2a gives the structure of the dense MoE layer. Therefore, the output of the dense MoE layer can be expressed as:

Algorithms, systems and applications, a comprehensive understanding of hybrid experts (MoE) from three perspectives

where, ?(x; Θ) is the gate value before the softmax operation.

Sparse MoE

Although the prediction accuracy of dense mixture experts is generally higher, its computational load is also very high.

In order to solve this problem, Shazeer et al.'s paper "Outrageously large neural networks: The sparsely-gated mixture-of-experts layer" introduces a sparsely gated MoE layer, which can only activate the selected network in each forward pass. a certain subset of experts. This strategy achieves sparsity by computing a weighted sum of the outputs of the top-k experts rather than aggregating the outputs of all experts together. Figure 2b shows the structure of such a sparse MoE layer.

According to the framework proposed in the above paper, Equation 2.2 can be modified to reflect the sparse gating mechanism:

Algorithms, systems and applications, a comprehensive understanding of hybrid experts (MoE) from three perspectives

Explanation here: The TopK (・, ?) function retains only the first k items of the original value of the vector, while setting the other items to −∞. This is followed by a softmax operation where all −∞ terms become approximately zero. The hyperparameter k should be selected according to the specific application. Common options are ? = 1 or ? = 2. Adding the noise term R_noise is a common strategy for training sparsely gated MoE layers, which promotes exploration among experts and improves the stability of MoE training.

Although sparse gating G (x; Θ) can significantly expand the parameter space of the model without increasing the corresponding computational cost, it can also lead to load balancing problems. The load balancing problem refers to the uneven distribution of load among experts - some experts are used frequently, while others are used rarely or not at all.

In order to solve this problem, each MoE layer must integrate an auxiliary loss function, whose role is to urge each batch of tokens to be evenly distributed to each expert. From the mathematical form description, first define a query batch containing T tokens B = {x_1, x_2, ..., x_?} and N experts. Then its auxiliary load balancing loss is defined as:

Algorithms, systems and applications, a comprehensive understanding of hybrid experts (MoE) from three perspectives

where D_i is the proportion of tokens assigned to expert i, and P_i is the proportion of gating probability assigned to expert i. To ensure that the batch is evenly distributed among the N experts, the load balancing loss function L_{load-balancing} should be minimized. When each expert is assigned the same number of tokens D_? = 1/? and the same gating probability P_? = 1/?, the optimal condition is reached:

Algorithms, systems and applications, a comprehensive understanding of hybrid experts (MoE) from three perspectives

At this time, the load of each expert reaches balance.

In the following, unless otherwise explicitly stated, the term "MoE" refers only to "sparse MoE".

Classification of hybrid experts

In order to help researchers find targets in LLM research that uses MoE in large numbers, the team developed a set of classification methods to classify these models according to three aspects: algorithm design, system design and applications.

Figure 3 shows this classification method and some representative research results.

Algorithms, systems and applications, a comprehensive understanding of hybrid experts (MoE) from three perspectives

The following will provide a comprehensive and in-depth introduction to each category.

Algorithm design of mixed experts

Gating function

Gating function (also known as routing function or router) is the basic component of all MoE architectures. Its role is to coordinate the use of expert calculations and combine experts Output.

The gating can be divided into three types based on the processing method for each input: sparse, dense and soft. The sparse gating mechanism activates some experts, the dense gating mechanism activates all experts, and the soft gating mechanism includes completely differentiable methods, including input token fusion and expert fusion. Figure 4 illustrates the various gating functions used in the MoE model. The sparse gating function activates selected experts when processing each input token, which can be regarded as a form of conditional calculation.

The gating function can implement various forms of gating decisions, such as binary decision-making, sparse or continuous decision-making, random or deterministic decision-making; it has been studied in depth and can use various forms of reinforcement learning and reverse Spread to train. Algorithms, systems and applications, a comprehensive understanding of hybrid experts (MoE) from three perspectives

Shazeer et al.'s study "Outrageously large neural networks: The sparsely-gated mixture-of-experts layer" pioneered a differentiable heuristic method using auxiliary load balancing loss, in which the The output of expert calculations is weighted. This introduces differentiability into the gating process, whereby the optimization of the gating function can be guided by gradients.

Later, this paradigm became the dominant paradigm in the field of MoE research. Because this method selects an expert for each input token, it can be thought of as a token-selective gating function.
The following are the main points of this section, see the original paper for details:

token selective gating

Auxiliary loss for token selective gating

token expert capacity for selective gating

Other advances in token selective gating
Untrainable token selective gating
Expert selective gating

Although sparse MoE has advantages in efficiency, the direction of dense MoE is still welcoming innovation. In particular, dense activation performs well on LoRA-MoE fine-tuning with relatively low computational overhead for LoRA experts. This approach enables efficient and flexible integration of multiple LoRAs to complete various downstream tasks. This preserves the generative capabilities of the original pre-trained model while preserving the unique characteristics of each LoRA for each task.

soft formula

For sparse MoE, a fundamental discrete optimization problem is how to decide which appropriate experts to assign to each token. To ensure balanced expert participation and minimize unallocated tokens, this often requires heuristic-assisted losses. This problem is particularly significant in scenarios involving out-of-distribution data (such as small inference batches, novel inputs, or transfer learning).

Similar to dense MoE, soft MoE methods also use all experts when processing each input, thereby maintaining full differentiability and thus avoiding the inherent problems of discrete expert selection methods. The difference between soft MoE and dense MoE is that the former alleviates computational requirements through gated and weighted fusion of input tokens or experts.

Experts

This section will introduce the architecture of the expert network within the MoE framework and discuss the gating function that coordinates the activation of these experts.

Network Types

Since MoE was integrated into the Transformer architecture, it often replaces the forward network (FFN) module in these models. Typically, each expert in the MoE layer replicates the architecture of the FFN it replaces.

This paradigm of using FFN as an expert is still mainstream, but people have also made many improvements to it.

Hyperparameters

The scale of the sparse MoE model is controlled by several key hyperparameters, including:

Number of experts per MoE layer
Size of each expert
MoE How often layers are placed throughout the model

The choice of these hyperparameters is crucial as it profoundly affects the performance and computational efficiency of the model in various tasks. Therefore, the optimal hyperparameters are selected based on the specific application requirements and computing infrastructure. Table 2 shows some configurations of models using MoE.

Algorithms, systems and applications, a comprehensive understanding of hybrid experts (MoE) from three perspectives

In addition, Table 3 lists the number of parameters and benchmark performance of some recent open source models.

Algorithms, systems and applications, a comprehensive understanding of hybrid experts (MoE) from three perspectives

Activation function

The sparse MoE model built on the dense Transformer architecture adopts an activation function similar to leading dense LLMs such as BERT, T5, GPT and LLAMA. Activation functions have evolved from ReLU to more advanced options such as GeLU, GeGLU, SwiGLU, and more.

This trend also extends to other components of MoE models, which often incorporate techniques such as Root Mean Square Layer Normalization (RMSNorm), Grouped Query Attention (GQA), and Rotated Position Embedding (RoPE).

Shared Experts

DeepSpeed-MoE innovatively introduces the Residual-MoE (Residual-MoE) architecture, in which each token is processed by a fixed expert plus a gate-selected expert, achieving each Two experts are involved in the processing on each layer, while the communication cost will not exceed the top-1 gating method. This approach treats the gating-selected MoE expert as an error correction aid for fixed dense FFNs.

The conditional MoE routing (CMR/Conditional MoE Routing) used in NLLB also adopts a similar approach, combining the output of dense FFN and MoE layers.

The paradigm that integrates fixed FFN and sparse MoE is often called shared experts, as shown in Figure 5b.

Algorithms, systems and applications, a comprehensive understanding of hybrid experts (MoE) from three perspectives

Models such as DeepSeekMoE, OpenMoE, Qwen1.5-MoE and MoCLE have recently adopted this paradigm, indicating that it is becoming a mainstream configuration. However, DeepSeekMoE and Qwen1.5-MoE use multiple shared experts instead of a single one.

Hybrid parameter-efficient fine-tuning expert

Parameter-efficient fine-tuning (PEFT) is a method to improve fine-tuning efficiency. Simply put, PEFT updates only a small part of the parameters of the base model during fine-tuning.

PEFT is successful, but due to its limited trainable parameters and possible catastrophic forgetting problems, this method is difficult to use in situations where generalization to multiple tasks is required.

To alleviate these limitations, Mixed Parameter Efficient Expert (MoPE) was born, which integrates the MoE framework and PEFT. MoPE integrates MoE's gating mechanism and multi-expert architecture, and each expert is built using PEFT technology. This clever combination can greatly improve the performance of PEFT in multi-task scenarios. In addition, since PEFT is used to build experts, MoPE also uses fewer parameters and is much more resource efficient than the traditional MoE model.

MoPE combines the multi-tasking characteristics of MoE and the resource efficiency of PEFT, which is a very promising research direction. Figure 6 classifies MoPEs according to their position in the Transformer model architecture. For a more detailed introduction to research results on MoPE, please refer to the original paper.

Algorithms, systems and applications, a comprehensive understanding of hybrid experts (MoE) from three perspectives

Training and inference solutions

Hybrid experts are progressing and developing, and related training and inference solutions are also progressing and developing.

The initial training and inference solution requires training the MoE model from scratch and directly using the trained model configuration to perform inference.

But now, many new paradigms have emerged in the training and inference of MoE models, including combining the advantages of dense and sparse models to complement each other.

Algorithms, systems and applications, a comprehensive understanding of hybrid experts (MoE) from three perspectives

Figure 7 shows the training and inference solutions related to MoE. It can be seen that the emerging solutions can be divided into three categories:

Dense to sparse: starting from dense model training and gradually transitioning to sparse MoE Configuration;
Sparse to dense: involves downgrading the sparse MoE model to a dense form, which is beneficial to implementing inference into a hardware form;
Expert model fusion: integrating multiple pre-trained dense expert models into one Unified MoE model.

Derived Technologies from MoE

Mixed Experts (MoE) have inspired many different variant technologies. For example, Xue et al.'s paper "Go wider instead of deeper" proposes WideNet with increased model width. The method is to replace the forward network (FFN) with the MoE layer while maintaining the shared trainability on the Transformer layer. parameters, except for the normalization layer.

In addition, there are SYT (Sparse Universal Transformer) proposed by Tan et al., MoT (Hybrid Token) proposed by Antoniak et al., SMoP (Sparse Mixed Prompter) proposed by Choi et al., and Chen et al. Lifelong-MoE, MoD (mixing depth) proposed by Raposo et al., etc.

To summarize, the development of MoE-derived technologies reveals a trend: MoE has more and more functions and is increasingly adaptable to different fields.

System Design of Mixed Experts

While Mixed Experts (MoE) can enhance the capabilities of large language models, it also brings new technical challenges because of its sparse and dynamic computational load.

GShard introduces expert parallelism, which can schedule segmented partial tokens according to the load balancing constraints of expert capabilities, thereby achieving parallel gating and expert calculations. This paradigm has become a fundamental strategy to promote efficient scaling of MoE models. We can think of this approach as an enhanced version of data parallelism - each expert in the MoE layer is assigned to a different device, while all non-expert layers are duplicated on all devices.

As shown in Figure 8a, the workflow of expert parallelization is to perform the following operations in sequence: gate routing, input encoding, All-to-All scheduling, expert calculation, All-to-All combination, and output decoding.

Algorithms, systems and applications, a comprehensive understanding of hybrid experts (MoE) from three perspectives

Generally speaking, the input size of GEMM needs to be large enough to fully utilize the computing device. Therefore, input encoding is used to aggregate the input tokens of the same expert into a continuous memory space, which is determined by the "token-expert mapping" in the gate routing. Afterwards, the role of All-to-All scheduling is to distribute the input tokens to the corresponding experts on each device. This is followed by expert localization calculations. After the calculation is completed, it is summarized through All-to-All combination, then the output is decoded, and the layout of the original data is restored according to the gating index.

In addition, some researchers are exploring the synergy between expert parallelism and other existing parallel strategies (such as tensors, pipelines, sequence parallelization) to improve the scalability and efficiency of MoE models in large-scale distributed environments.

Some hybrid parallelization examples are given in Figure 8, including (b) data + expert + tensor parallelization, (c) data + expert + pipeline parallelization, (d) expert + tensor parallelization.

It is necessary to realize that there is a complex interaction between computing efficiency, communication load, and memory usage. The choice of distributed parallelization strategy will affect it and will also be affected by different hardware configurations. Therefore, when deploying strategies for practical applications, careful trade-offs must be made and adjustments must be made to specific scenarios.

After that, the team introduced the system design challenges faced by MoE model development and the research results to solve these problems in three major sections: computing, communication and storage. Please see the original paper for details. Table 4 gives an overview of open source MoE frameworks.

Algorithms, systems and applications, a comprehensive understanding of hybrid experts (MoE) from three perspectives

Apps for Mixing Experts

現在 Transformer が独占している大規模言語モデル (LLM) の分野では、混合エキスパート (MoE) パラダイムは、トレーニングと推論の段階に過剰な計算要件を導入することなくモデルの機能を大幅に向上できるため、非常に魅力的です。このタイプのテクノロジーは、さまざまな下流タスクで LLM のパフォーマンスを大幅に向上させ、人間のレベルを超える AI アプリケーションを作成することさえできます。

非常に強力な GPT-4 は、2,200 億のパラメータを持つ 8 人の専門家で構成され、多様なデータセットとタスクでトレーニングされ、16 回の推論プロセスを反復する、ある種の MoE アーキテクチャを採用する可能性があるという噂もあります。この噂の詳細については、当サイトのレポート「究極の“暴露”：GPT-4モデルのアーキテクチャ、トレーニングコスト、データセット情報が明らかに」を参照してください。

したがって、MoE が自然言語処理、コンピュータービジョン、レコメンデーションシステム、およびマルチモーダルアプリケーションで開花しているのは驚くべきことではありません。

これらのアプリケーションでは基本的に、条件付き計算を使用してモデルのパラメータ数を大幅に増やして固定コンピューティングコストの下でモデルのパフォーマンスを向上させるか、効率的なマルチタスク学習を達成するためにゲートメカニズムを介して動的なエキスパート選択を実装する必要があります。。

チームは、これらのさまざまな分野における代表的な MoE アプリケーションも紹介しました。これは、読者が特定のタスクに MoE を使用する方法を理解するのに役立ちます。詳細については元の論文を参照してください。

課題と機会

ハイブリッドエキスパート、強力、コストを削減し、パフォーマンスを向上させます。見通しは良好ですが、課題はまだあります。

このセクションでは、チームは環境省に関連する主要な課題を整理し、重要な結果の達成が期待できる将来の研究の方向性を指摘します。これらの課題と研究の方向性を以下に簡単に示します。詳細については元の論文を参照してください。

トレーニングの安定性と負荷分散
スケーラビリティと通信オーバーヘッド
専門家の専門化とコラボレーション
スパースな活性化と計算効率
一般化とロバスト性
解釈性と透明性
最適な専門家アーキテクチャ
既存のフレームワークとの統合

詳細: MoE関連レポート

基本:

30年にわたる歴史的レビュー、Jeff Dean:研究をまとめた「疎なエキスパートモデル」のレビュー
なぜMoEに基づいた大規模なモデルがより注目に値するのでしょうか？
OpenAI と Mistral AI によって普及した MoE では何が起こっているのでしょうか?機械学習界の注目を集める包括的専門家のハイブリッドアーキテクチャ導入
MoEはNLPとCVの未来となるのか？
スパース混合エキスパートアーキテクチャ言語モデル（MoE）をゼロから実装することを段階的に教えます

フロンティア：

単一著者の論文、Googleが100万人のエキスパートの混合物を提案、密なフィードフォワード、疎な MoE を超えて
Microsoft は MoE に複数のヘッドを拡張させ、エキスパートのアクティブ化率を大幅に向上させています
疎なマルチモーダル大規模モデル、3B モデル MoE-LLaVA は LLaVA-1.5 に匹敵します- 7B
MoEとMambaは、状態空間モデルを数百億のパラメータに拡張するために力を合わせました
オープンソースの大型モデルの王座が再び移り変わり、1,320億のパラメータのDBRXがオンラインになり、基本モデルと微調整モデルの両方が利用可能です
CVPR 2024 | MoE に基づく一般的な画像融合モデル。複数のタスクを完了するために 2.8% のパラメーターが追加されます
CVPR 2023 |視覚的なマルチタスク学習のためのモデル
Google Gemini 1.5が迅速にリリース: MoEアーキテクチャ、100万コンテキスト
Apple大型モデルMM1が市場に参入: 300億パラメータ、マルチモーダル、MoEアーキテクチャ、著者の半数以上が中国人です
8x7B MoEはFlash Attendant 2と組み合わされて、10行未満のコードで高速推論を実現します
MoEのトレーニング効率とパフォーマンスのボトルネックを打破し、Huawei Panguスパースラージモデルの新しいアーキテクチャLocMoEがリリースされました
単一の4090、推論可能な2000億のスパース大規模モデル「Tiangong MoE」オープンソース
Mistralオープンソースの8X22B大規模モデル、OpenAIがGPT-4 Turboビジョンを更新、それらはすべてGoogleをいじめています
マグネットリンクが AI Circle を席巻、87GB シード直接オープンソース 8x7B MoE モデル
は MoE よりも多くの可能性を秘めていますか?進化的アルゴリズム融合モデルの新しい道は試してみる価値がありますか?
清華大学がSmartMoEをリリース: ワンクリックで高性能MoE疎大モデル分散トレーニングを実現
100万トークン、超強力なMoEモデルオープンソース、GPT-4-Turboに近いパフォーマンス

The above is the detailed content of Algorithms, systems and applications, a comprehensive understanding of hybrid experts (MoE) from three perspectives. For more information, please follow other related articles on the PHP Chinese website!