


Ant Group NextEvo fully open-sources AI Infra technology to enable large model training for 'autonomous driving”
Recently, NextEvo, the AI innovation R&D department of Ant Group, announced a comprehensive open source AI Infra technology, which can greatly improve the efficiency of large-scale model training. According to data, this technology can increase the effective proportion of training time to more than 95% and realize the automation of the training process. This breakthrough progress has significantly promoted the efficiency of AI research and development.
Picture: Ant Group’s automated distributed deep learning system DLRover is now fully open source
DLRover is a system designed for large-scale A technical framework designed for distributed training. In many enterprises today, training jobs are often run in complex and varied hybrid deployment clusters. No matter how complex the environment, DLRover can handle it with ease, just like driving on rough terrain.
The rapid development of large model technology in 2023 has given rise to an explosive growth in engineering practice. How to efficiently manage data, optimize training and inference efficiency, and make full use of existing computing power has become a key issue.
To complete a large model with a parameter level of 100 billion, such as GPT-3, it takes 32 years to train once with one card. Therefore, it is very important to make full use of computing power during the training process. To achieve this goal, there are two approaches that can be taken. First, the performance of a purchased GPU can be further improved to reach its full potential. Secondly, previously unavailable computing resources such as CPU and memory can be utilized. To achieve this, this problem can be solved through heterogeneous computing platforms.
DLRover has recently integrated the Flash Checkpoint (FCP) solution, which is used for Checkpoint management during model training. The traditional checkpoint management method has problems such as long time consumption, high-frequency checkpoints reducing the available training time, and excessive loss during recovery of low-frequency checkpoints. By applying the new solution FCP, after training the 100 billion parameter model, the training waste time caused by Checkpoint is reduced by about 5 times, and the persistence time is reduced by about 70 times. This improvement increases the effective training time from 90% to 95%. This means that the model training efficiency of DLRover has been significantly improved.
We have also integrated three new optimizer technologies. The optimizer is a core component of machine learning and is used to update neural network parameters to minimize the loss function. Among them, Ant's AGD (Auto-switchable optimizer with Gradient Difference of adjacent steps) optimizer is 1.5 times faster than the traditional AdamW technology in large model pre-training tasks. AGD has been used in multiple scenarios within ants and achieved remarkable results, and related papers have been included in NeurIPS '23.
Figure: In large model pre-training tasks, AGD can accelerate 1.5 times compared to AdamW
As an automated distributed depth Learning system, DLRover's "autonomous driving" function module also includes: Atorch, a PyTorch distributed training extension library. At the scale of hundreds of billions of parameter models and kilocalories, the computing power utilization rate of training can reach 60%, helping developers Further squeeze hardware computing power.
DLRover uses the concept of “ML for System” to enhance the intelligence of distributed training. It aims to use a system to allow developers to completely get rid of the constraints of resource allocation and focus on model training itself. Without any resource configuration input, DLRover can still provide optimal resource configuration for each training job.
It is understood that Ant Group continues to invest in technology in the field of artificial intelligence. Recently, Ant Group established an internal AI innovation research and development department NextEvo, which is responsible for all core technology research and development of Ant AI, including all of the Bailing model. R&D work involves core technologies such as AI algorithms, AI engineering, NLP, and AIGC, as well as technology R&D and product innovation in the fields of layout of multi-modal large models and digital humans.
At the same time, Ant Group has also accelerated the pace of open source, filled the relevant domestic technology gaps, and promoted the rapid development of the artificial intelligence industry.
DLRover open source address: https://www.php.cn/link/cf372cbe6eae54c6a6dfb3ebbcdc3404
The above is the detailed content of Ant Group NextEvo fully open-sources AI Infra technology to enable large model training for 'autonomous driving”. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



But maybe he can’t defeat the old man in the park? The Paris Olympic Games are in full swing, and table tennis has attracted much attention. At the same time, robots have also made new breakthroughs in playing table tennis. Just now, DeepMind proposed the first learning robot agent that can reach the level of human amateur players in competitive table tennis. Paper address: https://arxiv.org/pdf/2408.03906 How good is the DeepMind robot at playing table tennis? Probably on par with human amateur players: both forehand and backhand: the opponent uses a variety of playing styles, and the robot can also withstand: receiving serves with different spins: However, the intensity of the game does not seem to be as intense as the old man in the park. For robots, table tennis

On August 21, the 2024 World Robot Conference was grandly held in Beijing. SenseTime's home robot brand "Yuanluobot SenseRobot" has unveiled its entire family of products, and recently released the Yuanluobot AI chess-playing robot - Chess Professional Edition (hereinafter referred to as "Yuanluobot SenseRobot"), becoming the world's first A chess robot for the home. As the third chess-playing robot product of Yuanluobo, the new Guoxiang robot has undergone a large number of special technical upgrades and innovations in AI and engineering machinery. For the first time, it has realized the ability to pick up three-dimensional chess pieces through mechanical claws on a home robot, and perform human-machine Functions such as chess playing, everyone playing chess, notation review, etc.

The start of school is about to begin, and it’s not just the students who are about to start the new semester who should take care of themselves, but also the large AI models. Some time ago, Reddit was filled with netizens complaining that Claude was getting lazy. "Its level has dropped a lot, it often pauses, and even the output becomes very short. In the first week of release, it could translate a full 4-page document at once, but now it can't even output half a page!" https:// www.reddit.com/r/ClaudeAI/comments/1by8rw8/something_just_feels_wrong_with_claude_in_the/ in a post titled "Totally disappointed with Claude", full of

At the World Robot Conference being held in Beijing, the display of humanoid robots has become the absolute focus of the scene. At the Stardust Intelligent booth, the AI robot assistant S1 performed three major performances of dulcimer, martial arts, and calligraphy in one exhibition area, capable of both literary and martial arts. , attracted a large number of professional audiences and media. The elegant playing on the elastic strings allows the S1 to demonstrate fine operation and absolute control with speed, strength and precision. CCTV News conducted a special report on the imitation learning and intelligent control behind "Calligraphy". Company founder Lai Jie explained that behind the silky movements, the hardware side pursues the best force control and the most human-like body indicators (speed, load) etc.), but on the AI side, the real movement data of people is collected, allowing the robot to become stronger when it encounters a strong situation and learn to evolve quickly. And agile

So far, no product in the AI wearable device track has achieved particularly good results. AIPin, which was launched at MWC24 at the beginning of this year, once the evaluation prototype was shipped, the "AI myth" that was hyped at the time of its release began to be shattered, and it experienced large-scale returns in just a few months; RabbitR1, which also sold well at the beginning, was relatively It's better, but it also received negative reviews similar to "Android cases" when it was delivered in large quantities. Now, another company has entered the AI wearable device track. Technology media TheVerge published a blog post yesterday saying that AI startup Plaud has launched a product called NotePin. Unlike AIFriend, which is still in the "painting" stage, NotePin has now started

At this ACL conference, contributors have gained a lot. The six-day ACL2024 is being held in Bangkok, Thailand. ACL is the top international conference in the field of computational linguistics and natural language processing. It is organized by the International Association for Computational Linguistics and is held annually. ACL has always ranked first in academic influence in the field of NLP, and it is also a CCF-A recommended conference. This year's ACL conference is the 62nd and has received more than 400 cutting-edge works in the field of NLP. Yesterday afternoon, the conference announced the best paper and other awards. This time, there are 7 Best Paper Awards (two unpublished), 1 Best Theme Paper Award, and 35 Outstanding Paper Awards. The conference also awarded 3 Resource Paper Awards (ResourceAward) and Social Impact Award (

This afternoon, Hongmeng Zhixing officially welcomed new brands and new cars. On August 6, Huawei held the Hongmeng Smart Xingxing S9 and Huawei full-scenario new product launch conference, bringing the panoramic smart flagship sedan Xiangjie S9, the new M7Pro and Huawei novaFlip, MatePad Pro 12.2 inches, the new MatePad Air, Huawei Bisheng With many new all-scenario smart products including the laser printer X1 series, FreeBuds6i, WATCHFIT3 and smart screen S5Pro, from smart travel, smart office to smart wear, Huawei continues to build a full-scenario smart ecosystem to bring consumers a smart experience of the Internet of Everything. Hongmeng Zhixing: In-depth empowerment to promote the upgrading of the smart car industry Huawei joins hands with Chinese automotive industry partners to provide

Deep integration of vision and robot learning. When two robot hands work together smoothly to fold clothes, pour tea, and pack shoes, coupled with the 1X humanoid robot NEO that has been making headlines recently, you may have a feeling: we seem to be entering the age of robots. In fact, these silky movements are the product of advanced robotic technology + exquisite frame design + multi-modal large models. We know that useful robots often require complex and exquisite interactions with the environment, and the environment can be represented as constraints in the spatial and temporal domains. For example, if you want a robot to pour tea, the robot first needs to grasp the handle of the teapot and keep it upright without spilling the tea, then move it smoothly until the mouth of the pot is aligned with the mouth of the cup, and then tilt the teapot at a certain angle. . this
