DualBEV: significantly surpassing BEVFormer and BEVDet4D, open the book!-AI-php.cn

Table of Contents

Detailed explanation of DualBEV

DualBEV Overview

HeightTrans

BEV Height

Prob-Sampling

Acceleration

Prob-LSS

Dual Feature Fusion (DFF)

Experiment

Discussion

Conclusion

Home

Technology peripherals

DualBEV: significantly surpassing BEVFormer and BEVDet4D, open the book!

PHPz

Mar 21, 2024 pm 05:21 PM

technology Model

DualBEV: significantly surpassing BEVFormer and BEVDet4D, open the book!

This paper explores the problem of accurately detecting objects from different perspectives (such as perspective and bird's-eye views) in autonomous driving, especially how to effectively detect objects from perspective views (PV) to bird's eye view (BEV) spatial transformation features, this transformation is implemented through the visual transformation (VT) module. Existing methods are broadly divided into two strategies: 2D to 3D and 3D to 2D conversion. 2D-to-3D methods improve dense 2D features by predicting depth probabilities, but the inherent uncertainty of depth predictions, especially in distant regions, may introduce inaccuracies. While 3D to 2D methods usually use 3D queries to sample 2D features and learn attention weights for the correspondence between 3D and 2D features through a Transformer, which increases the complexity of calculation and deployment.

DualBEV: significantly surpassing BEVFormer and BEVDet4D, open the book!

The paper points out that existing methods such as HeightFormer and FB-BEV try to combine these two VT strategies, but these methods usually adopt a two-stage strategy due to the characteristics of dual VT The transformations are different and are limited by the initial feature performance, thus hindering seamless fusion between dual VTs. Furthermore, these methods still face challenges in achieving real-time deployment of autonomous driving.

In response to these problems, the paper proposes a unified feature conversion method, suitable for 2D to 3D and 3D to 2D visual conversion, and uses three probability measurements to evaluate the correspondence between 3D and 2D features. : BEV probability, projection probability and image probability. This new method aims to alleviate the impact of blank areas in the BEV grid on feature construction, distinguish multiple correspondences, and exclude background features during the feature conversion process.

By applying this unified feature transformation, the paper explores a new method of 3D to 2D visual transformation using convolutional neural networks (CNN) and introduces a method called HeightTrans. In addition to demonstrating its superior performance, it also demonstrates the potential for acceleration through precomputation, making it suitable for real-time autonomous driving applications. At the same time, by integrating this feature transformation, the traditional LSS process is enhanced, demonstrating its universality to current detectors.

Combining HeightTrans and Prob-LSS, the paper introduces DualBEV, an innovative method that considers and fuses the correspondences from BEV and perspective views in one stage, eliminating the need for initial features dependence. In addition, a powerful BEV feature fusion module, called dual feature fusion (DFF) module, is proposed to further help refine BEV probability prediction by utilizing channel attention module and spatial attention module. DualBEV follows the principle of "extensive input, strict output" and understands and represents the probability distribution of the scene by utilizing precise dual-view probabilistic correspondence.

The main contributions of the paper are as follows:

Reveals the inherent similarity between 3D to 2D and 2D to 3D visual conversion, and proposes a unified feature conversion method that can accurately establish correspondences from both BEV and perspective views, showing This narrows the gap between the two strategies.
A new CNN-based 3D to 2D visual conversion method HeightTrans is proposed, which effectively and efficiently establishes accurate 3D-2D correspondence through probability sampling and pre-calculation of lookup tables.
DFF is introduced for dual-view feature fusion. This fusion strategy captures the information of near and far regions in one stage, thereby generating comprehensive BEV features.
Their efficient framework DualBEV achieves 55.2% mAP and 63.4% NDS on the nuScenes test set, even without using Transformer, highlighting the importance of capturing accurate dual-view correspondence for view transformation.

Through these innovations, the paper proposes a new strategy to overcome the limitations of existing methods and achieve more efficient and accurate object detection in real-time application scenarios such as autonomous driving.

Detailed explanation of DualBEV

DualBEV: significantly surpassing BEVFormer and BEVDet4D, open the book!

The method proposed in this paper aims to solve the problem of autonomous driving through a unified feature conversion framework, DualBEV. BEV (bird's eye view) object detection problem. Below are the main content of the Methods section, outlining its different sub-sections and key innovations.

DualBEV Overview

DualBEV’s processing flow starts from image features obtained from multiple cameras , and then uses SceneNet to generate instance masks And depth map . Next, features are extracted and transformed through the HeightTrans module and Prob-LSS pipeline, and finally these features are fused and used to predict the probability distribution of the BEV space , to get The final BEV features are used for subsequent tasks.

HeightTrans

HeightTrans is based on the principle of 3D to 2D visual conversion, by selecting and projecting 3D positions into image space, and evaluating these 3D-2D correspondences. This method first samples a set of 3D points in a predefined BEV map, and then carefully considers and filters these correspondences to generate BEV features. HeightTrans enhances attention to small objects and solves the misleading problem caused by background pixels by adopting a multi-resolution sampling strategy and a probability sampling method. In addition, the problem of blank BEV grid is solved by introducing BEV probability . The HeightTrans module is one of the key technologies proposed in the paper, focusing on processing and transforming features through 3D to 2D visual transformation (VT). It is based on selecting 3D locations from a predefined Bird's Eye View (BEV) map and projecting these locations into image space, thereby evaluating the correspondence between 3D and 2D. The following is a detailed introduction to how HeightTrans works:

BEV Height

The HeightTrans method adopts a multi-resolution sampling strategy when processing height, covering the entire height range ( from -5 meters to 3 meters), with a resolution of 0.5 meters within the region of interest (ROI, defined as -2 meters to 2 meters), and 1.0 meters outside this range. This strategy helps increase focus on small objects that may be missed in coarser resolution sampling.

Prob-Sampling

HeightTrans adopts the following steps in probability sampling:

Define 3D sampling points: Predefine a set of 3D sampling points , each point is defined by its position in 3D space .
Projection to 2D space: Use the camera’s external parameter matrix and internal parameter matrix to project 3D points to points in the 2D image space , where represents the depth of the point.
Feature Sampling: Use a bilinear grid sampler Sampling image features at the projected position :
Use instance mask: In order to avoid the projection position falling on the background pixel, use the instance mask generated by SceneNet to represent the image probability , and It is applied to image features to reduce the impact of misleading information:
Handling multiple correspondences: Using a trilinear grid sampler In the depth map evaluates the situation where multiple 3D points are mapped to the same 2D position, that is, the projection probability :
Introducing the BEV probability : In order to solve the gaps in the BEV grid Since the grid does not provide useful information, the BEV probability is introduced to represent the occupancy probability of the BEV grid, where is the position in the BEV space:

Acceleration

By pre-computing the index of 3D points in BEV space and fixing the image feature index and depth map index during inference, HeightTrans can accelerate the visual transformation process. The final HeightTrans feature extends traditional LSS (Lift, Splat, Shoot) by predefining

Prob-LSS

for each BEV mesh. Pipeline that facilitates projection of each pixel into BEV space by predicting its depth probability. This method further integrates BEV probabilities to construct LSS features through the following formula:

Doing so can better handle the uncertainty in depth estimation, thereby reducing redundant information in the BEV space.

Dual Feature Fusion (DFF)

The DFF module is designed to fuse features from HeightTrans and Prob-LSS and effectively predict BEV probability. By combining the channel attention module and the spatial attention-augmented ProbNet, DFF is able to optimize feature selection and BEV probability prediction to enhance the representation of near and distant objects. This fusion strategy takes into account the complementarity of features from the two streams while also enhancing the accuracy of BEV probability by calculating local and global attention.

In short, the DualBEV framework proposed in this paper achieves efficient evaluation and conversion of the correspondence between 3D and 2D features by combining HeightTrans and Prob-LSS, as well as an innovative dual feature fusion module. This not only bridges the gap between 2D to 3D and 3D to 2D conversion strategies, but also accelerates the feature conversion process through pre-computation and probability measurement, making it suitable for real-time autonomous driving applications.

The key to this method is the precise correspondence and efficient fusion of features from different viewing angles, thereby achieving excellent performance in BEV object detection.

Experiment

DualBEV: significantly surpassing BEVFormer and BEVDet4D, open the book!

The variant of the DualBEV method (DualBEV* with an asterisk) performs best under single-frame input conditions , achieving 35.2% mAP and 42.5% NDS, indicating that it surpasses other methods in both accuracy and comprehensive performance. Especially on mAOE, DualBEV* achieves a score of 0.542, which is the best among single-frame methods. However, its performance on mATE and mASE is not significantly better than other methods.

When the number of input frames is increased to two frames, the performance of DualBEV is further improved, with mAP reaching 38.0% and NDS reaching 50.4%. This is the highest NDS among all listed methods, indicating that DualBEV can handle more complex inputs. Understand the scenario more fully. Among multi-frame methods, it also shows strong performance in mATE, mASE, and mAAE, especially significant improvement in mAOE, showing its advantage in estimating object directions.

It can be analyzed from these results that DualBEV and its variants perform well on multiple important performance indicators, especially in multi-frame settings, indicating that it has better performance for BEV object detection tasks. accuracy and robustness. Furthermore, these results also highlight the importance of using multi-frame data to improve the overall performance and estimation accuracy of the model.

DualBEV: significantly surpassing BEVFormer and BEVDet4D, open the book!

The following is an analysis of the results of each ablation experiment:

Add ProbNet, HeightTrans, CAF (Channel Attention Fusion), SAE (Spatial Attention Fusion) Enhanced) and other components gradually improve the performance of Baseline.
The addition of HeightTrans significantly improves mAP and NDS, which shows that introducing height information into visual transformation is effective.
CAF further improves mAP, but slightly increases latency.
The introduction of SAE increased NDS to a maximum of 42.5%, and also improved mAP, indicating that the spatial attention mechanism effectively enhanced model performance.
Different probability measures (projection probability , image probability , BEV probability ) are gradually added to the comparative test.
The model achieved the highest mAP and NDS when all three probabilities were used simultaneously, indicating that the combination of these probabilities is critical to model performance.
Prob-Sampling has a higher NDS (39.0%) than other VT operations at a similar delay (0.32ms), which emphasizes the performance superiority of probabilistic sampling.
Multi-resolution (MR) sampling strategy can achieve similar or better performance than the uniform sampling strategy when using the same number of sampling points.
By adding projection probability, image probability and BEV probability to the LSS process, Prob-LSS outperforms other LSS variants, improving mAP and NDS, showing the effectiveness of combining these probabilities.
Compared with the multi-stage Refine strategy, both the single-stage Add strategy and the DFF module can achieve higher NDS, and DFF also has a slight improvement in mAP, which shows that As a single-stage fusion strategy, DFF is beneficial in terms of efficiency and performance.

Ablation experiments show that components and strategies such as HeightTrans, probabilistic measures, Prob-Sampling and DFF are crucial to improving model performance. In addition, the use of multi-resolution sampling strategy on height information also proves its effectiveness. These findings support the authors' argument that each of the techniques presented in the methods section contributes positively to model performance.

Discussion

DualBEV: significantly surpassing BEVFormer and BEVDet4D, open the book!

This paper demonstrates the performance of its method through a series of ablation experiments. It can be seen from the experimental results that the DualBEV framework proposed in the paper and its various components have a positive impact on improving the accuracy of bird's-eye view (BEV) object detection.

The method of the paper gradually introduces ProbNet, HeightTrans, CAF (Channel Attention Fusion), and SAE (Spatial Attention Enhanced) modules into the baseline model, showing significant improvements in both mAP and NDS indicators. This proves that each component plays an important role in the overall architecture. Especially after the introduction of SAE, the NDS score increased to the highest point of 42.5%, while the delay only increased slightly, which shows that the method achieves a good balance between accuracy and delay.

The probabilistic ablation experimental results further confirm the importance of projection probability, image probability and BEV probability in improving detection performance. When these probabilities are introduced one by one, the mAP and NDS scores of the system improve steadily, demonstrating the importance of integrating these probabilistic measures into the BEV object detection task.

In the comparison of visual transformation (VT) operations, the Prob-Sampling method proposed in the paper shows lower latency and higher NDS score compared with other operations such as SCAda and Bilinear-Sampling, which Emphasizing its advantages in efficiency and performance. In addition, for different height sampling strategies, adopting a multi-resolution (MR) strategy instead of uniform sampling can further improve the NDS score, which demonstrates the importance of considering information at different heights in the scene to improve detection performance.

In addition, for different feature fusion strategies, the paper shows that the DFF method can still maintain a high NDS score while simplifying the model, which means that it is effective to fuse dual-stream features in a one-stage processing process .

However, although the method proposed in the paper performs well in many aspects, every improvement will also lead to an increase in system complexity and computational cost. For example, every time a new component is introduced (such as ProbNet, HeightTrans, etc.), the latency of the system will increase. Although the increase in latency is subtle, in applications with real-time or low-latency requirements, this may become a consideration. Furthermore, while probabilistic measures contribute to performance improvements, they also require additional computing resources to estimate these probabilities, potentially resulting in higher resource consumption.

The DualBEV method proposed in the paper has achieved remarkable results in improving the accuracy and comprehensive performance of BEV object detection, especially in combining the latest advances in deep learning with visual transformation technology. However, these advances come at the cost of slightly increased computational latency and resource consumption, and practical applications need to weigh these factors on a case-by-case basis.

Conclusion

This method performs well in the BEV object detection task, significantly improving accuracy and overall performance. By introducing probabilistic sampling, height transformation, attention mechanism and spatial attention augmentation network, DualBEV successfully improves multiple key performance indicators, especially in bird's-eye view (BEV) accuracy and scene understanding. Experimental results show that the paper's method is particularly effective in processing complex scenes and data from different perspectives, which is crucial for autonomous driving and other real-time monitoring applications.

The above is the detailed content of DualBEV: significantly surpassing BEVFormer and BEVDet4D, open the book!. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

2 weeks ago By DDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

WWE 2K25: How To Unlock Everything In MyRise

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7500

CakePHP Tutorial

1377

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

Related knowledge

The world's most powerful open source MoE model is here, with Chinese capabilities comparable to GPT-4, and the price is only nearly one percent of GPT-4-Turbo May 07, 2024 pm 04:13 PM

Imagine an artificial intelligence model that not only has the ability to surpass traditional computing, but also achieves more efficient performance at a lower cost. This is not science fiction, DeepSeek-V2[1], the world’s most powerful open source MoE model is here. DeepSeek-V2 is a powerful mixture of experts (MoE) language model with the characteristics of economical training and efficient inference. It consists of 236B parameters, 21B of which are used to activate each marker. Compared with DeepSeek67B, DeepSeek-V2 has stronger performance, while saving 42.5% of training costs, reducing KV cache by 93.3%, and increasing the maximum generation throughput to 5.76 times. DeepSeek is a company exploring general artificial intelligence

AI subverts mathematical research! Fields Medal winner and Chinese-American mathematician led 11 top-ranked papers | Liked by Terence Tao Apr 09, 2024 am 11:52 AM

AI is indeed changing mathematics. Recently, Tao Zhexuan, who has been paying close attention to this issue, forwarded the latest issue of "Bulletin of the American Mathematical Society" (Bulletin of the American Mathematical Society). Focusing on the topic "Will machines change mathematics?", many mathematicians expressed their opinions. The whole process was full of sparks, hardcore and exciting. The author has a strong lineup, including Fields Medal winner Akshay Venkatesh, Chinese mathematician Zheng Lejun, NYU computer scientist Ernest Davis and many other well-known scholars in the industry. The world of AI has changed dramatically. You know, many of these articles were submitted a year ago.

Hello, electric Atlas! Boston Dynamics robot comes back to life, 180-degree weird moves scare Musk Apr 18, 2024 pm 07:58 PM

Boston Dynamics Atlas officially enters the era of electric robots! Yesterday, the hydraulic Atlas just "tearfully" withdrew from the stage of history. Today, Boston Dynamics announced that the electric Atlas is on the job. It seems that in the field of commercial humanoid robots, Boston Dynamics is determined to compete with Tesla. After the new video was released, it had already been viewed by more than one million people in just ten hours. The old people leave and new roles appear. This is a historical necessity. There is no doubt that this year is the explosive year of humanoid robots. Netizens commented: The advancement of robots has made this year's opening ceremony look like a human, and the degree of freedom is far greater than that of humans. But is this really not a horror movie? At the beginning of the video, Atlas is lying calmly on the ground, seemingly on his back. What follows is jaw-dropping

KAN, which replaces MLP, has been extended to convolution by open source projects Jun 01, 2024 pm 10:03 PM

Earlier this month, researchers from MIT and other institutions proposed a very promising alternative to MLP - KAN. KAN outperforms MLP in terms of accuracy and interpretability. And it can outperform MLP running with a larger number of parameters with a very small number of parameters. For example, the authors stated that they used KAN to reproduce DeepMind's results with a smaller network and a higher degree of automation. Specifically, DeepMind's MLP has about 300,000 parameters, while KAN only has about 200 parameters. KAN has a strong mathematical foundation like MLP. MLP is based on the universal approximation theorem, while KAN is based on the Kolmogorov-Arnold representation theorem. As shown in the figure below, KAN has

Google is ecstatic: JAX performance surpasses Pytorch and TensorFlow! It may become the fastest choice for GPU inference training Apr 01, 2024 pm 07:46 PM

The performance of JAX, promoted by Google, has surpassed that of Pytorch and TensorFlow in recent benchmark tests, ranking first in 7 indicators. And the test was not done on the TPU with the best JAX performance. Although among developers, Pytorch is still more popular than Tensorflow. But in the future, perhaps more large models will be trained and run based on the JAX platform. Models Recently, the Keras team benchmarked three backends (TensorFlow, JAX, PyTorch) with the native PyTorch implementation and Keras2 with TensorFlow. First, they select a set of mainstream

Tesla robots work in factories, Musk: The degree of freedom of hands will reach 22 this year! May 06, 2024 pm 04:13 PM

The latest video of Tesla's robot Optimus is released, and it can already work in the factory. At normal speed, it sorts batteries (Tesla's 4680 batteries) like this: The official also released what it looks like at 20x speed - on a small "workstation", picking and picking and picking: This time it is released One of the highlights of the video is that Optimus completes this work in the factory, completely autonomously, without human intervention throughout the process. And from the perspective of Optimus, it can also pick up and place the crooked battery, focusing on automatic error correction: Regarding Optimus's hand, NVIDIA scientist Jim Fan gave a high evaluation: Optimus's hand is the world's five-fingered robot. One of the most dexterous. Its hands are not only tactile

FisheyeDetNet: the first target detection algorithm based on fisheye camera Apr 26, 2024 am 11:37 AM

Target detection is a relatively mature problem in autonomous driving systems, among which pedestrian detection is one of the earliest algorithms to be deployed. Very comprehensive research has been carried out in most papers. However, distance perception using fisheye cameras for surround view is relatively less studied. Due to large radial distortion, standard bounding box representation is difficult to implement in fisheye cameras. To alleviate the above description, we explore extended bounding box, ellipse, and general polygon designs into polar/angular representations and define an instance segmentation mIOU metric to analyze these representations. The proposed model fisheyeDetNet with polygonal shape outperforms other models and simultaneously achieves 49.5% mAP on the Valeo fisheye camera dataset for autonomous driving

Single card running Llama 70B is faster than dual card, Microsoft forced FP6 into A100 | Open source Apr 29, 2024 pm 04:55 PM

FP8 and lower floating point quantification precision are no longer the "patent" of H100! Lao Huang wanted everyone to use INT8/INT4, and the Microsoft DeepSpeed team started running FP6 on A100 without official support from NVIDIA. Test results show that the new method TC-FPx's FP6 quantization on A100 is close to or occasionally faster than INT4, and has higher accuracy than the latter. On top of this, there is also end-to-end large model support, which has been open sourced and integrated into deep learning inference frameworks such as DeepSpeed. This result also has an immediate effect on accelerating large models - under this framework, using a single card to run Llama, the throughput is 2.65 times higher than that of dual cards. one

See all articles