Table of Contents
1. The birth of Wen Xinyiyan
2. High-performance cluster design
3. Challenges of large model training
4. The process of large model training
Phase 1 : Parallel strategy and training optimization
Phase 2: Resource Management and Task Scheduling
5. Full-stack integration, "AI Big Base" accelerates large model training
5.1 Parallel Strategy and Training Optimization
Model Splitting
Topology awareness
Automatic Parallel
End-to-end adaptive training
Training Optimization
5.2 Resource Management and Task Scheduling
Resource Management
Elastic Fault Tolerance
6. AI inclusiveness in the era of large models
Home Technology peripherals AI AI large base, the answer to the era of large models

AI large base, the answer to the era of large models

May 11, 2023 pm 04:25 PM
ai Model Computing power

1. The birth of Wen Xinyiyan

"Wen Xinyiyan completed training on the largest high-performance GPU cluster in the country's AI field."

As early as 2021 In June 2019, in order to meet future large model training tasks, Baidu Intelligent Cloud began to plan the construction of a new high-performance GPU cluster, and jointly completed the design of an IB network architecture that can accommodate more than 10,000 cards in conjunction with NVIDIA. Each node between the nodes in the cluster Each GPU card is connected through the IB network, and the cluster construction will be completed in April 2022, providing single cluster EFLOPS level computing power.

In March 2023, Wen Xinyiyan was born on this high-performance cluster and continues to iterate new capabilities. Currently, the size of this cluster is still expanding.

Dr. Junjie Lai, General Manager of Solutions and Engineering at NVIDIA China: GPU clusters interconnected by high-speed IB networks are key infrastructure in the era of large models. NVIDIA and Baidu Intelligent Cloud jointly built the largest high-performance GPU/IB cluster in the domestic cloud computing market, which will accelerate Baidu's greater breakthrough in the field of large models.

2. High-performance cluster design

High-performance cluster is not a simple accumulation of computing power. It also requires special design and optimization to bring out the full potential of the cluster. computing power.

In distributed training, GPUs will continuously communicate between and within machines. While using high-performance networks such as IB and RoCE to provide high-throughput and low-latency services for inter-machine communication, it is also necessary to specially design the internal network connections of the servers and the communication topology in the cluster network to meet the communication requirements of large model training. requirements.

Achieving the ultimate design optimization requires a deep understanding of what each operation in the AI ​​task means to the infrastructure. Different parallel strategies in distributed training, that is, how to split models, data, and parameters, will produce different data communication requirements. For example, data parallelism and model parallelism will introduce a large number of intra-machine and inter-machine Allreduce operations respectively, and expert parallelism will Producing inter-machine All2All operations, 4D hybrid parallelism will introduce communication operations generated by various parallel strategies.

To this end, Baidu Smart Cloud optimizes the design from both stand-alone servers and cluster networks to build high-performance GPU clusters.

In terms of stand-alone servers, Baidu Smart Cloud’s super AI computer X-MAN has now evolved to its fourth generation. X-MAN 4.0 establishes high-performance inter-card communication for GPUs, providing 134 GB/s of Allreduce bandwidth within a single machine. This is currently Baidu’s server product with the highest degree of customization and the most specialized materials. In the MLCommons 1.1 list, X-MAN 4.0 ranks TOP2 in stand-alone hardware performance with the same configuration.

In terms of cluster network, a three-layer Clos architecture optimized for large model training is specially designed to ensure the performance and acceleration of the cluster during large-scale training. Compared with the traditional method, this architecture has been optimized with eight rails to minimize the number of hops in the communication between any card with the same number in different machines, and provides support for the Allreduce operation of the same card with the largest share of network traffic in AI training. High throughput and low latency network services.

This network architecture can support ultra-large-scale clusters with a maximum of 16,000 cards. This scale is the largest scale of all IB network box networking at this stage. The cluster's network performance is stable and consistent at a level of 98%, which is close to a state of stable communication. Verified by the large model algorithm team, hundreds of billions of model training jobs were submitted on this ultra-large-scale cluster, and the overall training efficiency at the same machine size was 3.87 times that of the previous generation cluster.

However, building large-scale, high-performance heterogeneous clusters is only the first step to successfully implement large models. To ensure the successful completion of AI large model training tasks, more systematic optimization of software and hardware is required.

3. Challenges of large model training

In the past few years, the parameter size of large models will increase at a rate of 10 times per year. Around 2020, a model with hundreds of billions of parameters will be considered a large model. By 2022, it will already require hundreds of billions of parameters to be called a large model.

Before large models, the training of an AI model was usually sufficient for a single machine with a single card or a single machine with multiple cards. The training cycle ranged from hours to days. Now, in order to complete the training of large models with hundreds of billions of parameters, large cluster distributed training with hundreds of servers and thousands of GPU/XPU cards has become a must, and the training cycle has also been extended to months.

In order to train GPT-3 with 175 billion parameters (300 billion token data), 1 block of A100 takes 32 years based on half-precision peak performance calculation, and 1024 blocks of A100 takes 34 days based on resource utilization of 45%. . Of course, even if time is not taken into account, one A100 cannot train a model with a parameter scale of 100 billion, because the model parameters have exceeded the memory capacity of a single card.

To conduct large model training in a distributed training environment, the training cycle is shortened from decades to dozens of days for a single card. It is necessary to break through various challenges such as computing walls, video memory walls, and communication walls, so that all resources in the cluster can can be fully utilized to speed up the training process and shorten the training cycle.

The computing wall refers to the huge difference between the computing power of a single card and the total computing power of the model. The single card computing power of A100 is only 312 TFLOPS, while GPT-3 requires a total computing power of 314 ZFLOPs. There is a difference of 9 orders of magnitude.

Video memory wall refers to the inability of a single card to completely store the parameters of a large model. GPT-3's 175 billion parameters alone require 700 GB of video memory (each parameter is calculated as 4 bytes), while the NVIDIA A100 GPU only has 80 GB of video memory.

The essence of the computing wall and the video memory wall is the contradiction between the limited single card capability and the huge storage and computing requirements of the model. This can be solved through distributed training, but after distributed training, you will encounter the problem of communication wall.

Communication wall, mainly because each computing unit of the cluster needs frequent parameter synchronization under distributed training, and communication performance will affect the overall computing speed. If the communication wall is not handled well, it is likely that the cluster will become larger and the training efficiency will decrease. Successfully breaking through the communication wall is reflected in the strong scalability of the cluster, that is, the multi-card acceleration capability of the cluster matches the scale. The linear acceleration ratio of multiple cards is an indicator for evaluating the acceleration capabilities of multiple cards in a cluster. The higher the value, the better.

These walls begin to appear during multi-machine and multi-card training. As the parameters of the large model become larger and larger, the corresponding cluster size also becomes larger and larger, and these three walls become higher and higher. At the same time, during long-term training of large clusters, equipment failures may occur, which may affect or interrupt the training process.

4. The process of large model training

Generally speaking, from the perspective of infrastructure, the entire process of large model training can be roughly divided into the following two stages:

Phase 1 : Parallel strategy and training optimization

After submitting the large model to be trained, the AI ​​framework will comprehensively consider the structure of the large model and other information, as well as the capabilities of the training cluster, to formulate a parallel training strategy for this training task. , and complete AI task placement. This process is to disassemble the model and place the task, that is, how to disassemble the large model and how to place the disassembled parts into each GPU/XPU of the cluster.

For AI tasks placed to run in GPU/XPU, the AI ​​framework will jointly train the cluster to perform full-link optimization at the single-card runtime and cluster communication levels to accelerate the operation of each AI task during the large model training process. Efficiency, including data loading, operator calculation, communication strategy, etc. For example, ordinary operators running in AI tasks are replaced with optimized high-performance operators, and communication strategies that adapt to the current parallel strategy and training cluster network capabilities are provided.

Phase 2: Resource Management and Task Scheduling

The large model training task starts running according to the parallel strategy formulated above, and the training cluster provides various high-performance resources for the AI ​​task. For example, in what environment does the AI ​​task run, how to provide resource docking for the AI ​​task, what storage method does the AI ​​task use to read and save data, what type of network facilities does the GPU/XPU communicate through, etc.

At the same time, during the operation process, the training cluster will combine with the AI ​​framework to provide a reliable environment for long-term training of large models through elastic fault tolerance and other methods. For example, how to observe and perceive the running status of various resources and AI tasks in the cluster, etc., and how to schedule resources and AI tasks when the cluster changes, etc.

From the dismantling of the above two stages, we can find that the entire large model training process relies on the close cooperation of the AI ​​framework and the training cluster to complete the breakthrough of the three walls and jointly ensure large model training Efficient and stable.

5. Full-stack integration, "AI Big Base" accelerates large model training

Combined with years of technology accumulation and engineering practice in the fields of AI and large models, Baidu launched the full-stack at the end of 2022 The self-developed AI infrastructure "AI Big Base" includes a three-layer technology stack of "chip-framework-model". It has key self-developed technologies and leading products at all levels, corresponding to Kunlun Core, PaddlePaddle, and WeChat. Big model of the heart.

Based on these three layers of technology stack, Baidu Intelligent Cloud has launched two major AI engineering platforms, "AI Middle Platform" and "Baidu Baige·AI Heterogeneous Computing Platform", which are respectively in development and resources. Improve efficiency at all levels, break through the three walls, and accelerate the training process.

Among them, "AI middle platform" relies on the AI ​​framework to develop parallel strategies and optimized environments for the large model training process, covering the entire life cycle of training. "Baidu Baige" realizes efficient chip enablement and provides management of various AI resources and task scheduling capabilities.

AI 大底座,大模型时代的答卷

Baidu's "AI Big Base" has carried out full-stack integration and system optimization of each layer of the technology stack, completing the technology integration construction of cloud and intelligence. End-to-end optimization and acceleration of large model training can be achieved.

Hou Zhenyu, Vice President of Baidu Group: Large model training is a systematic project. The cluster size, training time, and cost have all increased a lot compared to the past. Without full-stack optimization, it would be difficult to ensure the successful completion of large model training. Baidu's technical investment and engineering practices in large models over the years have enabled us to establish a complete set of software stack capabilities to accelerate the training of large models.

Next, we will combine the two stages of the large model training process mentioned above to describe how the various layers of the technology stack of the "AI Big Base" are integrated with each other. System optimization to achieve end-to-end optimization and acceleration of large model training.

5.1 Parallel Strategy and Training Optimization

Model Splitting

Flying Paddle can provide data parallelism, model parallelism, pipeline parallelism, parameter grouping and slicing, and expert parallelism for large model training and other rich parallel strategies. These parallel strategies can meet the needs of training large models with parameters ranging from billions to hundreds of billions, or even trillions, and achieve breakthroughs in computing and video memory walls. In April 2021, Feipiao was the first in the industry to propose a 4D hybrid parallel strategy, which can support the training of hundreds of billions of large models to be completed at the monthly level.

Topology awareness

Baidu Baige has cluster topology awareness capabilities specially prepared for large model training scenarios, including intra-node architecture awareness, inter-node architecture awareness, etc., such as the computing power inside each server. Information such as power, CPU and GPU/XPU, GPU/XPU and GPU/XPU link methods, and GPU/XPU and GPU/XPU network link methods between servers.

Automatic Parallel

Before the large model training task starts running, Feipiao can form a unified distributed resource graph for the cluster based on the topology awareness capabilities of Baidu Baige platform. At the same time, the flying paddle forms a unified logical calculation view based on the large model to be trained.

Based on these two pictures, Feipiao automatically searches for the optimal model segmentation and hardware combination strategy for the model, and allocates model parameters, gradients, and optimizer status to different GPUs/GPUs according to the optimal strategy. On XPU, complete the placement of AI tasks to improve training performance.

For example, model parallel AI tasks are placed on different GPUs on the same server, and these GPUs are linked through the NVSwitch inside the server. Place data-parallel and pipeline-parallel AI tasks on GPUs of the same number on different servers, and these GPUs are linked through IB or RoCE. Through this method of placing AI tasks according to the type of AI tasks, cluster resources can be used efficiently and the training of large models can be accelerated.

End-to-end adaptive training

During the running of the training task, if the cluster changes, such as a resource failure, or the cluster scale changes, Baidu Baige will perform fault tolerance replacement or elastic expansion and contraction. Since the locations of the nodes participating in the calculation have changed, the communication mode between them may no longer be optimal. Flying Paddle can automatically adjust model segmentation and AI task placement strategies based on the latest cluster information. At the same time, Baidu Baige completes the scheduling of corresponding tasks and resources.

Feipiao’s unified resource and computing view and automatic parallel capabilities, combined with Baidu Baige’s elastic scheduling capabilities, realize end-to-end adaptive distributed training of large models, covering the entire life of cluster training cycle.

This is an in-depth interaction between the AI ​​framework and the AI ​​heterogeneous computing power platform. It realizes the system optimization of the trinity of computing power, framework and algorithm, supports automatic and flexible training of large models, and has an end-to-end measured performance of 2.1 times. The performance improvement ensures the efficiency of large-scale training.

Training Optimization

After completing the splitting of the model and the placement of AI tasks, during the training process, in order to ensure that the operator operates in various mainstream AI frameworks such as Flying Paddle and Pytorch and various computing cards It can accelerate calculations, and Baidu Baige platform has a built-in AI acceleration suite. The AI ​​acceleration suite includes data layer storage acceleration, training and inference acceleration library AIAK, which optimizes the entire link from the dimensions of data loading, model calculation, distributed communication and other dimensions.

Among them, the optimization of data loading and model calculation can effectively improve the operating efficiency of a single card; the optimization of distributed communication, combined with high-performance networks such as cluster IB or RoCE and specially optimized communication topology, as well as reasonable AI task placement Strategies to jointly solve the communication wall problem.

Baidu Baige’s multi-card acceleration ratio in a kilo-card scale cluster has reached 90%, allowing the overall computing power of the cluster to be fully released.

In the test results of MLPerf Training v2.1 released in November 2022, the model training performance results submitted by Baidu using Fei Paddle plus Baidu Baige ranked first in the world under the same GPU configuration, end-to-end Both training time and training throughput exceed the NGC PyTorch framework.

5.2 Resource Management and Task Scheduling

Baidu Baige carries the operation of all AI tasks through the container engine CCE, and provides various AI resource management, architecture awareness, and elasticity through related container plug-ins. Fault tolerance and other capabilities enable breakthroughs in computing, memory, and communication walls at the resource efficiency level.

Resource Management

Baidu Baige can provide various computing, network, storage and other AI resources, including Baidu Taihang elastic bare metal server BBC, IB network, RoCE network, and parallel file storage PFS , object storage BOS, data lake storage acceleration RapidFS and other various cloud computing resources suitable for large model training.

When a task is running, these high-performance resources can be reasonably combined to further improve the efficiency of AI operations and realize computing acceleration of AI tasks throughout the process. Before the AI ​​task starts, the training data in the object storage BOS can be warmed up, and the data can be loaded into the data lake storage acceleration RapidFS through the elastic RDMA network. The elastic RDMA network can reduce communication latency by 2 to 3 times compared to traditional networks, and accelerates the reading of AI task data based on high-performance storage. Finally, AI task calculations are performed through the high-performance Baidu Taihang elastic bare metal server BBC or cloud server BCC.

Elastic Fault Tolerance

When running an AI task, it not only requires high-performance resources, but also needs to ensure the stability of the cluster and minimize the occurrence of resource failures to avoid interrupting training. However, resource failure cannot be absolutely avoided. The AI ​​framework and the training cluster need to jointly ensure that the training task can be recovered from the most recent state after being interrupted, thereby providing a reliable environment for long-term training of large models.

Baidu's self-developed heterogeneous collection library ECCL supports communication between Kunlun cores and other heterogeneous chips, and supports the perception of slow nodes and faulty nodes. Through Baidu Baige's resource elasticity and fault-tolerance strategy, slow nodes and faulty nodes are eliminated, and the latest architecture topology is fed back to Feipiao to re-arrange tasks and allocate corresponding training tasks to other XPU/GPUs to ensure smooth training. Run efficiently.

6. AI inclusiveness in the era of large models

Large models are a milestone technology for artificial intelligence to move towards general intelligence. Mastering large models well is a must-answer on the path to complete intelligent upgrades. Ultra-large-scale computing power and full-stack integrated software optimization are the best answers to this must-answer question.

In order to help society and industry quickly train their own large models and seize the opportunity of the times, Baidu Intelligent Cloud released the Yangquan Intelligent Computing Center at the end of 2022, equipped with the full-stack capabilities of Baidu's "AI Big Base", which can provide 4 EFLOPS of heterogeneous computing power. This is currently the largest and most technologically advanced data center in Asia.

Currently, Baidu Smart Cloud has opened all the capabilities of the "AI Big Base" to the outside world, realizing inclusive AI in the big model era, through central clouds in various regions, edge clouds BEC, local computing clusters LCC, private It is delivered in various forms such as Cloud ABC Stack, making it easy for society and industry to obtain intelligent services.

The above is the detailed content of AI large base, the answer to the era of large models. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
WWE 2K25: How To Unlock Everything In MyRise
1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

How to configure Debian Apache log format How to configure Debian Apache log format Apr 12, 2025 pm 11:30 PM

This article describes how to customize Apache's log format on Debian systems. The following steps will guide you through the configuration process: Step 1: Access the Apache configuration file The main Apache configuration file of the Debian system is usually located in /etc/apache2/apache2.conf or /etc/apache2/httpd.conf. Open the configuration file with root permissions using the following command: sudonano/etc/apache2/apache2.conf or sudonano/etc/apache2/httpd.conf Step 2: Define custom log formats to find or

How Tomcat logs help troubleshoot memory leaks How Tomcat logs help troubleshoot memory leaks Apr 12, 2025 pm 11:42 PM

Tomcat logs are the key to diagnosing memory leak problems. By analyzing Tomcat logs, you can gain insight into memory usage and garbage collection (GC) behavior, effectively locate and resolve memory leaks. Here is how to troubleshoot memory leaks using Tomcat logs: 1. GC log analysis First, enable detailed GC logging. Add the following JVM options to the Tomcat startup parameters: -XX: PrintGCDetails-XX: PrintGCDateStamps-Xloggc:gc.log These parameters will generate a detailed GC log (gc.log), including information such as GC type, recycling object size and time. Analysis gc.log

How to implement file sorting by debian readdir How to implement file sorting by debian readdir Apr 13, 2025 am 09:06 AM

In Debian systems, the readdir function is used to read directory contents, but the order in which it returns is not predefined. To sort files in a directory, you need to read all files first, and then sort them using the qsort function. The following code demonstrates how to sort directory files using readdir and qsort in Debian system: #include#include#include#include#include//Custom comparison function, used for qsortintcompare(constvoid*a,constvoid*b){returnstrcmp(*(

How to optimize the performance of debian readdir How to optimize the performance of debian readdir Apr 13, 2025 am 08:48 AM

In Debian systems, readdir system calls are used to read directory contents. If its performance is not good, try the following optimization strategy: Simplify the number of directory files: Split large directories into multiple small directories as much as possible, reducing the number of items processed per readdir call. Enable directory content caching: build a cache mechanism, update the cache regularly or when directory content changes, and reduce frequent calls to readdir. Memory caches (such as Memcached or Redis) or local caches (such as files or databases) can be considered. Adopt efficient data structure: If you implement directory traversal by yourself, select more efficient data structures (such as hash tables instead of linear search) to store and access directory information

How to configure firewall rules for Debian syslog How to configure firewall rules for Debian syslog Apr 13, 2025 am 06:51 AM

This article describes how to configure firewall rules using iptables or ufw in Debian systems and use Syslog to record firewall activities. Method 1: Use iptablesiptables is a powerful command line firewall tool in Debian system. View existing rules: Use the following command to view the current iptables rules: sudoiptables-L-n-v allows specific IP access: For example, allow IP address 192.168.1.100 to access port 80: sudoiptables-AINPUT-ptcp--dport80-s192.16

How to learn Debian syslog How to learn Debian syslog Apr 13, 2025 am 11:51 AM

This guide will guide you to learn how to use Syslog in Debian systems. Syslog is a key service in Linux systems for logging system and application log messages. It helps administrators monitor and analyze system activity to quickly identify and resolve problems. 1. Basic knowledge of Syslog The core functions of Syslog include: centrally collecting and managing log messages; supporting multiple log output formats and target locations (such as files or networks); providing real-time log viewing and filtering functions. 2. Install and configure Syslog (using Rsyslog) The Debian system uses Rsyslog by default. You can install it with the following command: sudoaptupdatesud

Debian mail server SSL certificate installation method Debian mail server SSL certificate installation method Apr 13, 2025 am 11:39 AM

The steps to install an SSL certificate on the Debian mail server are as follows: 1. Install the OpenSSL toolkit First, make sure that the OpenSSL toolkit is already installed on your system. If not installed, you can use the following command to install: sudoapt-getupdatesudoapt-getinstallopenssl2. Generate private key and certificate request Next, use OpenSSL to generate a 2048-bit RSA private key and a certificate request (CSR): openss

Debian mail server firewall configuration tips Debian mail server firewall configuration tips Apr 13, 2025 am 11:42 AM

Configuring a Debian mail server's firewall is an important step in ensuring server security. The following are several commonly used firewall configuration methods, including the use of iptables and firewalld. Use iptables to configure firewall to install iptables (if not already installed): sudoapt-getupdatesudoapt-getinstalliptablesView current iptables rules: sudoiptables-L configuration

See all articles