Ant Group recently announced the launch of a large model distributed training acceleration extension library called ATorch, which is an open source tool. The goal of ATorch is to help improve the intelligence of deep learning through automatic resource dynamic optimization and distributed training stability improvement. It is understood that in large model training, AToch can increase the computing power utilization rate of 100 billion model kilocalorie level training to 60%, which is equivalent to adding a powerful engine to a sports car. This will be an important tool for deep learning researchers and developers to help them train and optimize large models more efficiently.
With the With the explosion of large generative models, the size of data sets and parameters for model training has increased exponentially. In order to meet the training needs of this behemoth and to quickly iterate the model, distributed training has become one of the solutions. In this field, deep learning frameworks such as PyTorch and TensorFlow are widely adopted for model construction and training. In order to better adapt to large model training, a number of efforts have been carried out in the industry, one of which is Ant's open source ATorch toolkit. ATorch provides deep learning frameworks such as PyTorch with functions and tools that are more suitable for large model training, helping developers and researchers complete model training tasks more efficiently. The open source of this toolkit will further promote the development of large model training and bring more opportunities and challenges to research and application fields.
It is understood that ATorch adopts a layered architecture design with clear functions and comprehensive design, which can provide developers with an extremely streamlined development experience and leading stability guarantee. It mainly includes core functions such as unified distributed optimization strategy configuration interface, automatic distributed strategy search, automatic elastic fault tolerance, efficient dynamic memory management library, and self-developed optimizer accelerated convergence. As a high-performance extended acceleration library of the PyTorch framework, ATorch can minimize user code intrusion and provide an easy-to-use, high-performance solution for kilo-card-level training of large models with hundreds of billions of parameters.
Recently, in the practice of large model training optimization targeting open source models, ATorch has achieved excellent results. For example, it successfully increased the kilocalorie pre-training computing power utilization rate of Tsinghua University's open-source GLM-65b large model from 28.8% to 62%, and increased the pre-training computing power utilization rate of the LLama2-70b large model developed by Meta from 28.8% to 62%. 42% increased to 60%, and the training computing power utilization rate of the multi-modal large model Stable Diffusion developed by the British AI company Stability AI increased from 21.8% to 58.7%. In addition, ATorch has performed well in terms of kilocalorie training stability. The average daily pure training time has increased to 95%, the ckpt save time is controlled within 1 minute, and the training restart time is only 5 minutes at the fastest, reaching reached the industry-leading level.
Currently, ATorch has been integrated into Ant Group’s open source product DLRover, which is an intelligent distributed deep learning system built on cloud native technology. The addition of ATorch allows large model developers to focus more on the design of model architecture without having to deal with tedious engineering details, thereby improving training efficiency and intelligence.
The above is the detailed content of Ant's open source distributed training extension library AToch achieves an effective utilization rate of 60% of large model training computing power. For more information, please follow other related articles on the PHP Chinese website!