Table of Contents
What is the difficulty in automatic parallelization of large models
Distributed training artifact Galvatron, one-click to realize efficient automatic parallelization of large models
Key technologies
Experimental results
Beida Hetu Team Introduction
Home Technology peripherals AI Beidahetu releases Galvatron, a distributed training artifact, to realize efficient and automatic parallelization of large models with one click

Beidahetu releases Galvatron, a distributed training artifact, to realize efficient and automatic parallelization of large models with one click

Apr 11, 2023 pm 09:10 PM
Model paper

In recent times, "large models" have shone in various application scenarios in the AI ​​field. Among them, the large-scale pre-training model based on Transformer is one of the most typical large models and has become the current basis. The core architecture of the Foundation Model. For example, BERT and GPT series in the NLP field, ViT and Swin Transformer series in the CV field, as well as the recently very popular multi-expert hybrid model MoE and multi-modal model CLIP, etc., all use Transformer as the core infrastructure. Correspondingly, such dense large models have parameters that often number in the billions, tens of billions, or even trillions. They face high computing, storage, and communication overhead, and also bring huge challenges to AI infrastructure.

In order to support the training of large models, people have developed many tools (such as Megatron proposed by NVIDIA, DeepSpeed ​​proposed by Microsoft, FairSeq proposed by Meta, etc.) to achieve Various parallel methods, data parallelism, tensor model parallelism, pipeline parallelism, sharded data parallelism, etc. These systems provide good encapsulation of the above parallel methods and shield the corresponding implementation details from the outside, allowing users to implement hybrid parallel strategies by adding configurations.

Based on the above ideas, there has been a lot of work focusing on how to express various parallel methods at the tensor or operator level. The "automation" of this type of work is mainly reflected in the parallel API to Execution layer transformation process. However, if it is only limited to designing parallel APIs or intermediate expressions, this engineering encapsulation does not fundamentally solve the problem of distributed training. The most intuitive result is that users still cannot be liberated from the problems of distributed deployment. In fact, the distributed deployment of large models is a very complex problem. Most of the current distributed training systems rely on users' manual repeated attempts and the experience of system experts to deploy, causing serious problems of low resource utilization efficiency. , there is a considerable gap from true "automatic parallelism".

Based on this, the Beidahetu team proposed Galvatron, a distributed training artifact, to achieve efficient automatic parallelization of large models. The research paper was selected for the top international conference VLDB 2023.

Beidahetu releases Galvatron, a distributed training artifact, to realize efficient and automatic parallelization of large models with one click

  • Paper address: https://arxiv.org/abs/2211.13878
  • Project code link: https://github.com/PKU-DAIR/Hetu/tree/main/tools/Galvatron

What is the difficulty in automatic parallelization of large models

Researchers believe that the difficulties in automatic parallelization of large models are mainly reflected in the following three aspects:

(1) Diversity: First of all, in terms of parallel methods, the current parallel methods of large models are blooming. Even for the same operator, regardless of mixed parallel methods, different parallel methods There will also be significant differences in the underlying parallelism methods, resulting in different memory overhead, communication costs, and computational efficiency. The following figure shows the four most important basic parallel methods, namely Data Parallelism, Tensor Parallelism, Pipeline Parallelism, and Sharded Data Parallelism. The process of distributed execution of simple matrix multiplication operators on Zhang GPU.

Beidahetu releases Galvatron, a distributed training artifact, to realize efficient and automatic parallelization of large models with one click

Comparison diagram of parallel methods

Secondly, in terms of models, various model architectures have emerged in endlessly recently, which are often accompanied by different model configurations (such as different input sequence lengths, number of model layers, model hidden layer width, etc.), This results in a difference in computational load. In addition, in terms of hardware, users often face very differentiated cluster environments, and may face different memory capacities, communication bandwidths, computing capabilities, etc. Generally speaking, due to the above-mentioned diversity, no parallel technology can always achieve the best training efficiency, and "automatic parallelism" has become the core challenge of distributed training.

(2) Complexity: The above analysis is relatively simple. In fact, even for the same operator, multiple different basic parallel methods can be applied at the same time. If we take into account the combination of these basic parallel methods, The mixed parallel approach will cause the problem to become very complicated. More importantly, the calculation graph of a large model often has a very large structure, which requires a larger cluster. If each operator is explored (including selecting appropriate computing resources in the cluster and designing corresponding hybrid parallel methods ), will bring about the problem of combination space explosion, and it becomes difficult to find the optimal distributed execution plan for the entire model.

(3) Practicality: In addition, practicality is also a very important issue. On the one hand, in the process of automatic parallel search, for various distributed execution solutions, relatively accurate memory, communication, and computing overhead must be provided. Otherwise, the results will deviate too much from the actual execution, resulting in suboptimal solutions or being unable to do so at all. use. To this end, a very accurate cost model is needed to model different model structures and hardware conditions. On the other hand, the additional time overhead caused by the system's automatic parallel capabilities must be within an acceptable range, and excessively high search costs are also unacceptable.

Distributed training artifact Galvatron, one-click to realize efficient automatic parallelization of large models

System features:

In order to solve the above problems, researchers have proposed some series of works to explore hybrid parallel automatic search: one type of work mainly discusses the search space that considers both data parallelism and model parallelism. Representative works include FlexFlow, Tofu, and the other type of work mainly discusses the search space that considers both data parallelism and model parallelism. One type of work is generated from pipeline parallel scenarios and combines it with data parallelism. Representative works include PipeDream and DAPPLE. On this basis, there are also some derivative works, such as Unity and Alpa, which further expand the scope of automatic parallel exploration. The system "Galvatron" proposed by the Beidahetu team also belongs to the research field of automatic parallel search, but compared with existing work, this system mainly has the following three advantages:

(1) In terms of diversity, the parallel dimensions that existing work can support are still relatively limited, and Galvatron can not only support more parallel dimensions, but also be able to accurately model the more differentiated Transformer model structure, and in Its adaptive tuning capabilities have been verified under different cluster hardware conditions.

Beidahetu releases Galvatron, a distributed training artifact, to realize efficient and automatic parallelization of large models with one click

Comparison diagram of large model distributed training system

(2) In terms of complexity, due to its advantages in diversity, Galvatron faces an unprecedentedly large search space. To this end, researchers have explored several processes in the current large-scale distributed training process. Important observations, experimentally or theoretically verified, serve as pruning criteria for the search space, thereby enabling efficient distributed execution plan optimization.

(3) In terms of practicality, this research combines the advantages of theoretical modeling and experimental measurement to achieve accurate estimates of memory, communication, and computing overhead, and even takes into account The problem of reduced GPU execution efficiency caused by overlapping computation and communication ensures that sufficiently accurate automatic parallel optimization results can be obtained.

In addition, Galvatron chooses PyTorch as the execution engine at the bottom layer, which is compatible with common mainstream Transformer model implementations such as Huggingface, so it will not bring any additional burden to PyTorch users; at the same time, it does not require Users pay additional system installation or debugging costs, and only need to add a few lines of code when using it, and the entire process of automatic parallelization can be easily completed.

Galvatron workflow and user interface display

Key technologies

1. Search space decomposition based on decision tree

The design goal of Galvatron is to efficiently automatically search within a complex and large parallel policy space and generate the optimal parallel execution plan for a given Transformer model and distributed environment. In terms of search space, Galvatron is the first automatic parallel training system in the industry that considers four mainstream parallel methods, including data parallelism (DP), sharded data parallelism (SDP), and tensor parallelism (tensor parallelism). parallelism (TP) and pipeline parallelism (PP). Since the hybrid parallel strategy will include any combination of the above four parallel algorithms, the search space brought by this combination is very large in a multi-GPU scenario. For example, in a dual-machine four-card scenario, one feasible strategy is to use 2-way TP within the machine and 2-way PP between machines. Another feasible strategy is to use 2-way PP within the machine and between machines. Use 2-way DP. When the number of GPUs in a node is expanded to 8 cards, there are hundreds of candidate strategies for each layer of the model. As the number of model layers increases, the size of its search space increases exponentially, making it difficult to explore effectively.

To efficiently search such a huge search space, the study first proposes the following observations as a guide:

  • Takeway#1 :PP tends to be placed across device islands. Here "device island" refers to a group of devices with high internal bandwidth. In most Transformer models, the communication volume of PP is significantly less compared to other parallel methods. Therefore, people usually prioritize PP slicing the model and placing it between islands of equipment.
  • Takeway#2: Under the premise of homogeneous devices, the parallel strategy tends to divide the devices evenly. For example, 2-way DP for a 4-card GPU will tend to split the device into two sets of 2-card devices, rather than a set of 1-card and a set of 3-card devices. In this case, the optimal hybrid parallelism policy within one device group is consistent with the optimal policy within other groups.
  • Takeway#3: Generally speaking, when you can mix DP and SDP, using only SDP is theoretically better. According to the analysis results, the communication overhead and memory overhead of N-way SDP are better than the combination of Beidahetu releases Galvatron, a distributed training artifact, to realize efficient and automatic parallelization of large models with one click and Beidahetu releases Galvatron, a distributed training artifact, to realize efficient and automatic parallelization of large models with one click, where Beidahetu releases Galvatron, a distributed training artifact, to realize efficient and automatic parallelization of large models with one click.

Based on the above important observations, this study proposes a search space construction method based on decision trees:

(1) Given a Transformer model, based on Takeway#1 and Takeway#2, Galvatron first uses PP to divide the model into multiple stages, and at the same time divide the equipment into multiple equipment groups evenly and continuously. For example, in the 8-card scenario, the model is divided into 1/2/4/8-way PP, corresponding to the device group size of 8/4/2/1 respectively.

(2) Each PP segmentation corresponds to a decision tree and a sub-search space. The total number of decision leaf nodes is the size of the device group, and the height of the decision tree is the available parallel methods. Number, that is, one parallel strategy can be applied to each level of the decision tree.

(3) Parallel strategies cannot be reused between different levels of the decision tree.

(4) The degree of non-leaf nodes is selected from the exponential power of 2 {2,4,8,…} by default.

Based on the above decision tree construction rules, the decision tree constructed by Galvatron can represent any combination of the above parallelism. Takeway#1 and Takeway#2 help Galvatron avoid inefficient parallel combinations and reduce the search space. For the scenario of training a one-layer model on an 8-card GPU, the above rules will produce 34 candidate hybrid parallel strategies. Furthermore, after using Takeway#3 to prune the situation where DP and SDP appear in a decision tree at the same time, the number of 8-card candidate strategies is reduced to 22.

The following figure shows a schematic diagram of the decision tree under different PP parallelism (8/4/2/1) in the 8-card GPU scenario.

Beidahetu releases Galvatron, a distributed training artifact, to realize efficient and automatic parallelization of large models with one click

8 Schematic diagram of decision tree under different PP parallelism (8/4/2/1) in card GPU scenario

2 . Parallel optimization algorithm based on dynamic programming

Existing systems such as Megatron or DeepSpeed ​​usually specify the global parallel scheme and its corresponding degree of parallelism by the user, which severely limits the use of distributed Ability to express plans for execution. The optimization goal of Galvatron is to automatically generate the optimal distributed execution plan without the user specifying any parallel configuration when the user is given a model definition and distributed environment. Specifically, given an L-layer model M and N GPU devices with memory capacity E, the optimization goal of Galvatron is to search for the highest system throughput T_pt and return the corresponding parallel solution. The parallel solution here refers to the layer ( or operator) as the basic unit of a fine-grained hybrid parallel strategy.

Beidahetu releases Galvatron, a distributed training artifact, to realize efficient and automatic parallelization of large models with one click

Algorithm 1: Galvatron optimization process

Optimization process: The optimization process of Galvatron is shown in Algorithm 1. Galvatron's outermost loop gradually increases the search batch size until it exceeds the device memory; given each candidate batch size B, Galvatron first splits the model PP according to Takeaway#1 and searches for different degrees of parallelism P (line 4), After selecting P-way PP, the model is divided into P stages (line 6), and all the corresponding equipment is divided into P groups, each group containing N/P equipment; then Galvatron builds the corresponding decision tree , which can represent any combination of DP, SDP, and TP without duplication or omission, thereby obtaining the strategy set S; then for each model stage M_i, under the device memory limit E, Galvatron uses dynamic programming search to obtain each layer The optimal hybrid parallel strategy and returns the minimum time cost (line 9); finally, Galvatron selects the strategy with the highest throughput among all possible PP parallelism and batch size and returns it (line 15).

Dynamic programming search: The following introduces the dynamic programming search algorithm in the Galvatron parallel optimization workflow. For a given model stage containing L layers, the cost function C(L,E) is used to represent the total execution time of the L layer model under the device memory limit E, and represents the execution time of the L layer using the strategy S_j, where the strategy S_j is the strategy in the parallel strategy candidate set S. Setting the initial value

Beidahetu releases Galvatron, a distributed training artifact, to realize efficient and automatic parallelization of large models with one click

, Galvatron’s dynamic programming search follows the following state transition equation (Formula 1):

Beidahetu releases Galvatron, a distributed training artifact, to realize efficient and automatic parallelization of large models with one click

Among them, is the memory overhead of layer L using strategy S_j, is the conversion overhead caused by the L-th layer using strategy S_j and the previous layer using strategy S_i. During the state transfer process, when the memory overhead exceeds the device memory limit device memory limit E, the overhead function C returns infinity.

Complexity analysis: The computational complexity of the dynamic programming search (Formula 1) used by Galvatron is O(LE|S|). It can be seen that the size of the search space S of each layer is very important to the overall search complexity. The search space decomposition based on the decision tree proposed by Galvatron can significantly reduce the search space and control the search overhead within a reasonable range.

3. Execution cost estimation method based on hybrid modeling

Galvatron uses a policy cost estimation module to estimate the computing, communication, and memory costs of hybrid parallel strategies. Existing cost estimation methods mainly include measurement (profiling) and simulation (simulating). Galvatron draws on the strengths of both and designs a cost-effective, efficient and accurate cost estimation method. Specifically, for memory overhead, Galvatron uses the shape and data type of the tensor to complete the estimation; for calculation time, Galvatron measures the sample-by-sample calculation time through profiling on a single device, combining the batch size and fitting function to estimate the overall calculation Time; for communication time, Galvatron obtains the estimated communication time by dividing the communication volume by the device communication bandwidth, where the communication volume is calculated theoretically and the communication bandwidth is measured by profiling.

Based on the above estimation results, Galvatron calculates the cost c(l,s) of a given layer using a given strategy through the simulating execution process. Different from the cost model of existing distributed training systems, Galvatron considers the impact of overlapping computing and communication on GPU performance degradation for the first time in the modeling. This study experimentally found that GPU performance degradation due to overlap can significantly affect the execution efficiency of the system, which has been ignored in previous work. As a result, Galvatron's cost estimates are more accurate and parallel optimization is better.

Experimental results

Experimental settings: In the experiment, the researcher combined Galvatron and four baseline systems (DP, SDP, TP, PP) using a single parallel strategy For comparison with DeepSpeed ​​3D Parallelism set by experts, two additional weakened versions of Galvatron were set up as auxiliary baselines to carry out automatic parallel search in a limited parallel strategy combination space (i.e. TP DP, PP DP). This study selected the NLP Transformer models BERT and T5, and the CV Transformer models ViT and Swin Transformer as experimental objects.

Beidahetu releases Galvatron, a distributed training artifact, to realize efficient and automatic parallelization of large models with one click

Comparison of throughput between Galvatron and baseline systems under 8 GPUs and 20G video memory

Experimental comparison effect: This study first conducted experiments in the eight-card Nvidia RTX TITAN 24GB environment. Experiments show that under different model sizes and different memory constraints, Galvatron always achieves the optimal throughput, and compared with the existing state-of-the-art single parallel methods and hybrid parallel methods, the training throughput is significantly improved. . Specifically, on the ViT model, Galvatron's throughput acceleration ratio can reach up to 338% compared to a single strategy, and its throughput acceleration ratio can reach up to 55% compared to other hybrid parallel strategies; in the other three models Compared with single strategy and existing mixed strategy, Galvatron has an acceleration ratio of up to 200%-334% and 28%-52%.

Beidahetu releases Galvatron, a distributed training artifact, to realize efficient and automatic parallelization of large models with one click

##Illustration of the partial optimal parallel strategy obtained by Galvatron search

Interpretability Experiment: This study selected some optimal parallel strategies obtained by Galvatron search for display. For the BERT model in the case of 8GB (Case A), Galvatron chose two hybrid parallel strategies PP-TP-DP and PP-TP-SDP. When the available video memory increased to 12GB, Galvatron gave up PP and chose to use more Multiple DPs, and SDP is introduced to save video memory space. The situation is slightly different on Swin Transformer. Different layers of the model show obvious heterogeneity. When the memory is relatively scarce (Case C), the parallelism of shallow SDP is higher. As the number of layers increases, each layer The activation becomes smaller and the parameters become more, so TP gradually replaces SDP. When the video memory increases (Case D), not only is PP re-enabled to replace part of the inefficient SDP, but the shallow layer tends to use DP more obviously.

Scalability Experiment: The study further tested Galvatron on larger clusters, including an environment with 16 Nvidia RTX TITAN GPUs and a 64-card Nvidia A100 GPUs environment. In the 16-card environment, Galvatron still has the highest throughput compared to other strategies. Compared with the experimental results of 8 cards with the same memory limit, due to the more diverse hybrid parallel strategy, Galvatron can obtain more than 2 times the throughput on 16 cards. Speedup ratio. In the 64-card experiment, Galvatron also had the highest throughput rate among other strategies. This shows that Galvatron has good scalability. Detailed results can be found in the original paper.

Beida Hetu Team Introduction

The Hetu development team comes from the Data and Intelligence Research Lab at Peking University (hereinafter referred to as the laboratory). The laboratory is led by Professor Cui Bin from the School of Computer Science at Peking University. Over the years, it has mainly conducted cutting-edge research in the fields of artificial intelligence and big data. It has achieved many results in theoretical and technological innovation and system research and development, and has been published in top international academic conferences and journals. More than 100 academic papers.

Hetu system is a distributed deep learning system for very large models. Compared with the existing old distributed deep learning framework, it has advantages in system functionality, system complexity and system ease. It has many innovative contributions in terms of usability, such as automatic distributed parallel strategies, consistency protocols and communication architectures, GPU operator optimization, etc. The Hetu team has currently carried out academic innovations in a variety of distributed machine learning or deep learning scenarios, and relevant results have been included in top international conferences such as SIGMOD, VLDB, ICML, KDD, etc. Among them, the sparse large model distributed training system HET won the VLDB 2022 Best Prize. Jia Ke Scalable Data Science Paper Award. Galvatron, the paper accepted by VLDB 2023, is another breakthrough achieved by the Hetu team in dense large model distributed training scenarios. It has been integrated into the Hetu system and is open source. At present, the Hetu team has carried out scientific research cooperation and application implementation with many well-known companies such as Tencent, Alibaba, Kuaishou, and ByteDance.

The above is the detailed content of Beidahetu releases Galvatron, a distributed training artifact, to realize efficient and automatic parallelization of large models with one click. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Repo: How To Revive Teammates
1 months ago By 尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island Adventure: How To Get Giant Seeds
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

The world's most powerful open source MoE model is here, with Chinese capabilities comparable to GPT-4, and the price is only nearly one percent of GPT-4-Turbo The world's most powerful open source MoE model is here, with Chinese capabilities comparable to GPT-4, and the price is only nearly one percent of GPT-4-Turbo May 07, 2024 pm 04:13 PM

Imagine an artificial intelligence model that not only has the ability to surpass traditional computing, but also achieves more efficient performance at a lower cost. This is not science fiction, DeepSeek-V2[1], the world’s most powerful open source MoE model is here. DeepSeek-V2 is a powerful mixture of experts (MoE) language model with the characteristics of economical training and efficient inference. It consists of 236B parameters, 21B of which are used to activate each marker. Compared with DeepSeek67B, DeepSeek-V2 has stronger performance, while saving 42.5% of training costs, reducing KV cache by 93.3%, and increasing the maximum generation throughput to 5.76 times. DeepSeek is a company exploring general artificial intelligence

KAN, which replaces MLP, has been extended to convolution by open source projects KAN, which replaces MLP, has been extended to convolution by open source projects Jun 01, 2024 pm 10:03 PM

Earlier this month, researchers from MIT and other institutions proposed a very promising alternative to MLP - KAN. KAN outperforms MLP in terms of accuracy and interpretability. And it can outperform MLP running with a larger number of parameters with a very small number of parameters. For example, the authors stated that they used KAN to reproduce DeepMind's results with a smaller network and a higher degree of automation. Specifically, DeepMind's MLP has about 300,000 parameters, while KAN only has about 200 parameters. KAN has a strong mathematical foundation like MLP. MLP is based on the universal approximation theorem, while KAN is based on the Kolmogorov-Arnold representation theorem. As shown in the figure below, KAN has

Hello, electric Atlas! Boston Dynamics robot comes back to life, 180-degree weird moves scare Musk Hello, electric Atlas! Boston Dynamics robot comes back to life, 180-degree weird moves scare Musk Apr 18, 2024 pm 07:58 PM

Boston Dynamics Atlas officially enters the era of electric robots! Yesterday, the hydraulic Atlas just "tearfully" withdrew from the stage of history. Today, Boston Dynamics announced that the electric Atlas is on the job. It seems that in the field of commercial humanoid robots, Boston Dynamics is determined to compete with Tesla. After the new video was released, it had already been viewed by more than one million people in just ten hours. The old people leave and new roles appear. This is a historical necessity. There is no doubt that this year is the explosive year of humanoid robots. Netizens commented: The advancement of robots has made this year's opening ceremony look like a human, and the degree of freedom is far greater than that of humans. But is this really not a horror movie? At the beginning of the video, Atlas is lying calmly on the ground, seemingly on his back. What follows is jaw-dropping

Google is ecstatic: JAX performance surpasses Pytorch and TensorFlow! It may become the fastest choice for GPU inference training Google is ecstatic: JAX performance surpasses Pytorch and TensorFlow! It may become the fastest choice for GPU inference training Apr 01, 2024 pm 07:46 PM

The performance of JAX, promoted by Google, has surpassed that of Pytorch and TensorFlow in recent benchmark tests, ranking first in 7 indicators. And the test was not done on the TPU with the best JAX performance. Although among developers, Pytorch is still more popular than Tensorflow. But in the future, perhaps more large models will be trained and run based on the JAX platform. Models Recently, the Keras team benchmarked three backends (TensorFlow, JAX, PyTorch) with the native PyTorch implementation and Keras2 with TensorFlow. First, they select a set of mainstream

AI subverts mathematical research! Fields Medal winner and Chinese-American mathematician led 11 top-ranked papers | Liked by Terence Tao AI subverts mathematical research! Fields Medal winner and Chinese-American mathematician led 11 top-ranked papers | Liked by Terence Tao Apr 09, 2024 am 11:52 AM

AI is indeed changing mathematics. Recently, Tao Zhexuan, who has been paying close attention to this issue, forwarded the latest issue of "Bulletin of the American Mathematical Society" (Bulletin of the American Mathematical Society). Focusing on the topic "Will machines change mathematics?", many mathematicians expressed their opinions. The whole process was full of sparks, hardcore and exciting. The author has a strong lineup, including Fields Medal winner Akshay Venkatesh, Chinese mathematician Zheng Lejun, NYU computer scientist Ernest Davis and many other well-known scholars in the industry. The world of AI has changed dramatically. You know, many of these articles were submitted a year ago.

Tesla robots work in factories, Musk: The degree of freedom of hands will reach 22 this year! Tesla robots work in factories, Musk: The degree of freedom of hands will reach 22 this year! May 06, 2024 pm 04:13 PM

The latest video of Tesla's robot Optimus is released, and it can already work in the factory. At normal speed, it sorts batteries (Tesla's 4680 batteries) like this: The official also released what it looks like at 20x speed - on a small "workstation", picking and picking and picking: This time it is released One of the highlights of the video is that Optimus completes this work in the factory, completely autonomously, without human intervention throughout the process. And from the perspective of Optimus, it can also pick up and place the crooked battery, focusing on automatic error correction: Regarding Optimus's hand, NVIDIA scientist Jim Fan gave a high evaluation: Optimus's hand is the world's five-fingered robot. One of the most dexterous. Its hands are not only tactile

FisheyeDetNet: the first target detection algorithm based on fisheye camera FisheyeDetNet: the first target detection algorithm based on fisheye camera Apr 26, 2024 am 11:37 AM

Target detection is a relatively mature problem in autonomous driving systems, among which pedestrian detection is one of the earliest algorithms to be deployed. Very comprehensive research has been carried out in most papers. However, distance perception using fisheye cameras for surround view is relatively less studied. Due to large radial distortion, standard bounding box representation is difficult to implement in fisheye cameras. To alleviate the above description, we explore extended bounding box, ellipse, and general polygon designs into polar/angular representations and define an instance segmentation mIOU metric to analyze these representations. The proposed model fisheyeDetNet with polygonal shape outperforms other models and simultaneously achieves 49.5% mAP on the Valeo fisheye camera dataset for autonomous driving

DualBEV: significantly surpassing BEVFormer and BEVDet4D, open the book! DualBEV: significantly surpassing BEVFormer and BEVDet4D, open the book! Mar 21, 2024 pm 05:21 PM

This paper explores the problem of accurately detecting objects from different viewing angles (such as perspective and bird's-eye view) in autonomous driving, especially how to effectively transform features from perspective (PV) to bird's-eye view (BEV) space. Transformation is implemented via the Visual Transformation (VT) module. Existing methods are broadly divided into two strategies: 2D to 3D and 3D to 2D conversion. 2D-to-3D methods improve dense 2D features by predicting depth probabilities, but the inherent uncertainty of depth predictions, especially in distant regions, may introduce inaccuracies. While 3D to 2D methods usually use 3D queries to sample 2D features and learn the attention weights of the correspondence between 3D and 2D features through a Transformer, which increases the computational and deployment time.

See all articles