In the wave of large models, training and deploying state-of-the-art dense set LLMs poses huge challenges in terms of computational requirements and associated costs, especially at scales of tens or hundreds of billions of parameters. To address these challenges, sparse models, such as Mixture of Experts (MoE) models, have become increasingly important. These models offer an economically viable alternative by distributing computation to various specialized sub-models, or "experts," with the potential to match or even exceed the performance of dense set models with very low resource requirements.
On June 3, important news came from the field of open source large models: Kunlun Wanwei announced the open source of the 200 billion sparse large model Skywork-MoE. While maintaining strong performance, it has greatly improved Reduces reasoning costs.
Based on the previous open source Skywork-13B model intermediate checkpoint extension of Kunlun Wanwei. It is the first open source 100 billion MoE large model that fully applies and implements MoE Upcycling technology. It is also the first to support the use of a single 4090 An open source 100 billion MoE large model for server inference.
What attracts more attention to the large model community is that Skywork-MoE’s model weights and technical reports are completely open source and free for commercial use without application.
Model weight download address:
○ https://huggingface.co/Skywork/Skywork-MoE-base
○ https://huggingface.co/Skywork/Skywork-MoE-Base-FP8
Model open source warehouse: https://github.com/SkyworkAI/Skywork-MoE
Model technical report: https://github.com/SkyworkAI/Skywork-MoE/blob/main/skywork-moe-tech-report.pdf
Model inference code: (Supports 8-bit quantitative loading inference on 8x4090 server) https://github.com/SkyworkAI/vllm
Skywork-MoE is currently available on 8x4090 server The largest open source MoE model for inference. The 8x4090 server has a total of 192GB of GPU memory. Under FP8 quantization (weight occupies 146GB), using the non-uniform Tensor Parallel parallel reasoning method pioneered by the Kunlun Wanwei team, Skywork-MoE can reach 2200 tokens/s within a suitable batch size. Hesitation.
For the complete related inference framework code and installation environment, please see: https://github.com/SkyworkAI/Skywork-MoE
Skywork-MoE Introduction
This open source Skywork-MoE model belongs to the R&D model series of Tiangong 3.0, and is the mid-range model (Skywork-MoE-Medium). The total parameter amount of the model is 146B, and the amount of activated parameters is 146B. 22B, there are 16 Experts in total, each Expert is 13B in size, and 2 Experts are activated each time.
It is understood that Tiangong 3.0 has also trained two MoE models, 75B (Skywork-MoE-Small) and 400B (Skywork-MoE-Large), which are not included in this open source.
Kunlun Wanwei evaluated Skywork-MoE based on the current major mainstream model evaluation lists. Under the same activation parameter amount of 20B (inference calculation amount), Skywork-MoE's capabilities are at the forefront of the industry, close to 70B Dense Model. This reduces the model’s inference cost by nearly 3 times.
It is worth noting that the total parameter size of Skywork-MoE is 1/3 smaller than the total parameter size of DeepSeekV2, achieving similar capabilities with a smaller parameter size. .
Technical Innovation
In order to solve the problems of difficult MoE model training and poor generalization performance, Skywork-MoE designed two training optimization algorithms:
Gating Logits Normalization operation
Skywork-MoE adds a new normalization operation in the token distribution logic of the Gating Layer, making the parameter learning of the Gating Layer more inclined to the selected top -2 experts, increasing the confidence of the MoE model for top-2:
Adaptive Aux Loss
is different from the traditional fixed coefficient ( (Fixed hyperparameters) aux loss, Skywork-MoE allows the model to adaptively select appropriate aux loss hyperparameter coefficients at different stages of MoE training, thereby keeping the Drop Token Rate within an appropriate range, and achieving expert distribution Balance can also allow expert learning to be differentiated, thereby improving the overall performance and generalization level of the model. In the early stage of MoE training, due to insufficient parameter learning, the Drop Token Rate was too high (the token distribution was too different). At this time, a larger aux loss was needed to help token load balance; in the later stage of MoE training, the Skywork-MoE team hopes A certain degree of differentiation is still ensured between Experts to avoid Gating's tendency to randomly distribute Tokens, so a lower aux loss is required to reduce correction.
Training Infra
How to efficiently conduct large-scale distributed training of MoE models is a difficult challenge. Skywork-MoE proposes two important parallel optimization designs to achieve 38% training throughput of MFU on a kilocalorie cluster, where MFU calculates the theoretical computational load with an activation parameter of 22B.
Expert Data Parallel
Different from the existing EP (Expert Parallel) and ETP (Expert Tensor Parallel) designs in the Megatron-LM community, the Skywork-MoE team proposed a parallel design solution called Expert Data Parallel. This parallel solution can When the number of Experts is small, the model can still be segmented efficiently, and the all2all communication introduced by Experts can also be optimized and masked to the greatest extent. Compared with EP's limitation on the number of GPUs and ETP's inefficiency on kilo-card clusters, EDP can better solve the parallel pain points of large-scale distributed training MoE. At the same time, EDP's design is simple, robust, easy to expand, and can be compared Quick implementation and verification.
This is the simplest EDP example. In the case of two cards, TP = 2, EP = 2, where the attention part uses Tensor Parallel, Expert part Using Expert Parallel
Non-uniform split pipeline parallel
Due to the Embedding calculation of the first stage and the Loss calculation of the last stage, as well as the Pipeline Buffer There is an obvious imbalance in the computing load and video memory load of each stage when the Layer is evenly divided under pipeline parallelism. The Skywork-MoE team proposed a non-uniform pipeline parallel segmentation and recalculation layer allocation method to make the overall computing/graphics memory load more balanced and improve the end-to-end training throughput by about 10%.Compare the parallel bubbles under uniform and non-uniform cutting: For a 24-layer LLM, (a) is uniform cutting Divided into 4 stages, the number of layers in each stage is: [6, 6, 6, 6]. (b) is the optimized non-uniform segmentation method, divided into 5 stages, the number of layers in each stage is :[5, 5, 5, 5, 4], in the stage when the middle flow is full, the non-uniformly divided bubbles are lower.
In addition, Skywork-MoE also used a series of experiments based on Scaling Law to explore which constraints affect the performance of Upcycling and From Scratch training MoE models. A rule of thumb that can be followed is: if the FLOPs of training the MoE model are more than 2 times that of training the Dense model, then it will be better to choose from Scratch to train MoE, otherwise , choosing Upcycling to train MoE can significantly reduce training costs. ###The above is the detailed content of A single 4090 inferable, 200 billion sparse large model 'Tiangong MoE' is open source. For more information, please follow other related articles on the PHP Chinese website!