Recently, NextEvo, the AI innovation R&D department of Ant Group, announced a comprehensive open source AI Infra technology, which can greatly improve the efficiency of large-scale model training. According to data, this technology can increase the effective proportion of training time to more than 95% and realize the automation of the training process. This breakthrough progress has significantly promoted the efficiency of AI research and development.
Picture: Ant Group’s automated distributed deep learning system DLRover is now fully open source
DLRover is a system designed for large-scale A technical framework designed for distributed training. In many enterprises today, training jobs are often run in complex and varied hybrid deployment clusters. No matter how complex the environment, DLRover can handle it with ease, just like driving on rough terrain.
The rapid development of large model technology in 2023 has given rise to an explosive growth in engineering practice. How to efficiently manage data, optimize training and inference efficiency, and make full use of existing computing power has become a key issue.
To complete a large model with a parameter level of 100 billion, such as GPT-3, it takes 32 years to train once with one card. Therefore, it is very important to make full use of computing power during the training process. To achieve this goal, there are two approaches that can be taken. First, the performance of a purchased GPU can be further improved to reach its full potential. Secondly, previously unavailable computing resources such as CPU and memory can be utilized. To achieve this, this problem can be solved through heterogeneous computing platforms.
DLRover has recently integrated the Flash Checkpoint (FCP) solution, which is used for Checkpoint management during model training. The traditional checkpoint management method has problems such as long time consumption, high-frequency checkpoints reducing the available training time, and excessive loss during recovery of low-frequency checkpoints. By applying the new solution FCP, after training the 100 billion parameter model, the training waste time caused by Checkpoint is reduced by about 5 times, and the persistence time is reduced by about 70 times. This improvement increases the effective training time from 90% to 95%. This means that the model training efficiency of DLRover has been significantly improved.
We have also integrated three new optimizer technologies. The optimizer is a core component of machine learning and is used to update neural network parameters to minimize the loss function. Among them, Ant's AGD (Auto-switchable optimizer with Gradient Difference of adjacent steps) optimizer is 1.5 times faster than the traditional AdamW technology in large model pre-training tasks. AGD has been used in multiple scenarios within ants and achieved remarkable results, and related papers have been included in NeurIPS '23.
Figure: In large model pre-training tasks, AGD can accelerate 1.5 times compared to AdamW
As an automated distributed depth Learning system, DLRover's "autonomous driving" function module also includes: Atorch, a PyTorch distributed training extension library. At the scale of hundreds of billions of parameter models and kilocalories, the computing power utilization rate of training can reach 60%, helping developers Further squeeze hardware computing power.
DLRover uses the concept of “ML for System” to enhance the intelligence of distributed training. It aims to use a system to allow developers to completely get rid of the constraints of resource allocation and focus on model training itself. Without any resource configuration input, DLRover can still provide optimal resource configuration for each training job.
It is understood that Ant Group continues to invest in technology in the field of artificial intelligence. Recently, Ant Group established an internal AI innovation research and development department NextEvo, which is responsible for all core technology research and development of Ant AI, including all of the Bailing model. R&D work involves core technologies such as AI algorithms, AI engineering, NLP, and AIGC, as well as technology R&D and product innovation in the fields of layout of multi-modal large models and digital humans.
At the same time, Ant Group has also accelerated the pace of open source, filled the relevant domestic technology gaps, and promoted the rapid development of the artificial intelligence industry.
DLRover open source address: https://www.php.cn/link/cf372cbe6eae54c6a6dfb3ebbcdc3404
The above is the detailed content of Ant Group NextEvo fully open-sources AI Infra technology to enable large model training for 'autonomous driving”. For more information, please follow other related articles on the PHP Chinese website!