How to use transformer to effectively correlate lidar-millimeter wave radar-visual features?-AI-php.cn

One of the basic tasks of autonomous driving is three-dimensional target detection, and many methods are now based on multi-sensor fusion. So why is multi-sensor fusion needed? Whether it is lidar and camera fusion, or millimeter wave radar and camera fusion, the main purpose is to use the complementary connection between point clouds and images to improve the accuracy of target detection. . With the continuous application of Transformer architecture in the field of computer vision, attention mechanism-based methods have improved the accuracy of fusion between multiple sensors. The two papers shared are based on this architecture and propose novel fusion methods to make greater use of the useful information of their respective modalities and achieve better fusion.

TransFusion：

Main Contributions

Lidar and camera are two important three-dimensional target detection sensors in autonomous driving , but in sensor fusion, it mainly faces the problem of low detection accuracy caused by poor image stripe conditions. The point-based fusion method is to fuse lidar and cameras through hard association, which will lead to some problems: a) simply splicing point cloud and image features, in the presence of low-quality image features, the detection performance will be seriously degraded ;b) Finding hard correlations between sparse point clouds and images wastes high-quality image features and is difficult to align. To solve this problem, a soft association method is proposed. This method treats lidar and camera as two independent detectors, cooperating with each other and taking full advantage of the advantages of the two detectors. First, a traditional object detector is used to detect objects and generate bounding boxes, and then the bounding boxes and point clouds are matched to obtain a score for which bounding box each point is associated with. Finally, the image features corresponding to the edge boxes are fused with the features generated by the point cloud. This method can effectively avoid the decline in detection accuracy caused by poor image stripe conditions. At the same time,

This paper introduces TransFusion, a fusion framework for lidar and cameras, to solve the correlation problem between the two sensors. The main contributions are as follows:

Proposed a 3D detection fusion model based on transformer-based lidar and camera, which shows excellent robustness to poor image quality and sensor misalignment;
Introduced several simple yet effective adjustments for object queries to improve the quality of initial bounding box predictions for image fusion, and also designed an image-guided query initialization module to handle objects that are difficult to detect in point clouds;
Not only achieves advanced 3D detection performance in nuScenes, but also extends the model to 3D tracking tasks and achieves good results.

Detailed module explanation

How to use transformer to effectively correlate lidar-millimeter wave radar-visual features?

Figure 1 The overall framework of TransFusion

In order to solve the above image entries To solve the problem of differences and correlation between different sensors, a Transformer-based fusion framework - TransFusion is proposed. The model relies on standard 3D and 2D backbone networks to extract LiDAR BEV features and image features, and then consists of two layers of Transformer decoders: the first layer decoder uses sparse point clouds to generate initial bounding boxes; the second layer decoder converts the first layer The object query is combined with the image feature query to obtain better detection results. The spatial modulation attention mechanism (SMCA) and image-guided query strategy are also introduced to improve detection accuracy. Through the detection of this model, better image features and detection accuracy can be obtained.

Query Initialization

How to use transformer to effectively correlate lidar-millimeter wave radar-visual features?

##LiDAR-Camera Fusion

If When an object contains only a small number of lidar points, only the same number of image features can be obtained, wasting high-quality image semantic information. Therefore, this paper retains all image features and uses the cross-attention mechanism and adaptive method in Transformer to perform feature fusion, so that the network can adaptively extract location and information from the image. In order to alleviate the spatial misalignment problem of LiDAR BEV features and image features coming from different sensors, a

spatial modulation cross-attention module (SMCA) is designed, which passes two dimensional coordinates around the two-dimensional center of each query projection. A dimensional circular Gaussian mask weights cross-attention.

Image-Guided Query Initialization

How to use transformer to effectively correlate lidar-millimeter wave radar-visual features?

Figure 2 Image-Guided Query Module

This module uses lidar and image information as object queries at the same time, by sending image features and lidar BEV features into the cross-attention mechanism network, projecting them onto the BEV plane, and generating fused BEV features. As shown in Figure 2, multi-view image features are first folded along the height axis as the key value of the cross attention mechanism network, and the lidar BEV features are sent to the attention network as queries to obtain the fused BEV features, which are used for heat map prediction. , and averaged with the lidar-only heat map Ŝ to get the final heat map Ŝ to select and initialize the target query. Such operations enable the model to detect targets that are difficult to detect in lidar point clouds.

Experiments

Datasets and Metrics

The nuScenes dataset is a large-scale automated system for 3D detection and tracking. Driving dataset, containing 700, 150 and 150 scenes, used for training, validation and testing respectively. Each frame contains a lidar point cloud and six calibration images covering a 360-degree horizontal field of view. For 3D detection, the main metrics are mean average precision (mAP) and nuScenes detection score (NDS). mAP is defined by BEV center distance rather than 3D IoU, and the final mAP is calculated by averaging distance thresholds of 0.5m, 1m, 2m, 4m for 10 categories. NDS is a comprehensive measure of mAP and other attribute measures, including translation, scale, orientation, velocity, and other box attributes. .

The Waymo dataset includes 798 scenes for training and 202 scenes for validation. The official indicators are mAP and mAPH (mAP weighted by heading accuracy). mAP and mAPH are defined based on 3D IoU thresholds, which are 0.7 for vehicles and 0.5 for pedestrians and cyclists. These metrics are further broken down into two difficulty levels: LEVEL1 for bounding boxes with more than 5 lidar points, and LEVEL2 for bounding boxes with at least one lidar point. Unlike nuScenes' 360-degree cameras, Waymo's cameras only cover about 250 degrees horizontally.

Training On the nuScenes data set, use DLA34 as the 2D backbone network of the image and freeze its weights, set the image size to 448×800; select VoxelNet as the 3D backbone network of the lidar . The training process is divided into two stages: the first stage only uses LiDAR data as input, and trains the 3D backbone 20 times with the first-layer decoder and FFN feedforward network to generate initial 3D bounding box predictions; the second stage trains the LiDAR-Camera The fusion and image-guided query initialization modules are trained for 6 times. The left image is the transformer decoder layer architecture used for initial bounding box prediction; the right image is the transformer decoder layer architecture used for LiDAR-Camera fusion.

How to use transformer to effectively correlate lidar-millimeter wave radar-visual features?

Figure 3 Decoder layer design

Comparison with state-of-the-art methods

First compare TransFusion and other SOTA The performance of the method on the 3D target detection task is shown in Table 1 below, which is the result in the nuScenes test set. It can be seen that the method has reached the best performance at the time (mAP is 68.9%, NDS is 71.7%). TransFusion-L only uses lidar for detection, and its detection performance is significantly better than previous single-modal detection methods, and even exceeds some multi-modal methods. This is mainly due to the new association mechanism and query initialization. Strategy. Table 2 shows the results of LEVEL 2 mAPH on the Waymo validation set.

How to use transformer to effectively correlate lidar-millimeter wave radar-visual features?

Table 1 Comparison with SOTA method in nuScenes test

How to use transformer to effectively correlate lidar-millimeter wave radar-visual features?

Table 2 LEVEL 2 on Waymo validation set mAPH

Robustness against harsh image conditions

Using TransFusion-L as the benchmark, different fusion frameworks are designed to verify the robustness. The three fusion frameworks are point-by-point splicing and fusion of lidar and image features (CC), point enhancement fusion strategy (PA) and TransFusion. As shown in Table 3, by dividing the nuScenes data set into day and night, the TransFusion method will bring greater performance improvement at night. During the inference process, the features of the image are set to zero to achieve the effect of randomly discarding several images in each frame. As can be seen in Table 4, when some images are unavailable during the inference process, the detection performance will decrease significantly. , where the mAP of CC and PA dropped by 23.8% and 17.2% respectively, while TransFusion remained at 61.7%. The uncalibrated sensor will also greatly affect the performance of 3D target detection. The experimental setting randomly adds a translation offset to the transformation matrix from the camera to the lidar, as shown in Figure 4. When the two sensors are offset by 1m, the mAP of TransFusion It only decreased by 0.49%, while the mAP of PA and CC decreased by 2.33% and 2.85% respectively.

How to use transformer to effectively correlate lidar-millimeter wave radar-visual features?

Table 3 mAP during the day and night

How to use transformer to effectively correlate lidar-millimeter wave radar-visual features?

Table 4 mAP under different numbers of images

How to use transformer to effectively correlate lidar-millimeter wave radar-visual features?

Figure 4 mAP

Ablation experiment with sensor misalignment

It can be seen from the results of Table 5 d)-f) that when there is no query initialization In this case, the detection performance drops a lot. Although increasing the number of training rounds and the number of decoder layers can improve the performance, it still cannot achieve the ideal effect. This also proves from the side that the proposed initialization query strategy can reduce the number of network layers. . As shown in Table 6, image feature fusion and image-guided query initialization bring mAP gains of 4.8% and 1.6% respectively. In Table 7, through the comparison of accuracy in different ranges, TransFusion's detection performance in difficult-to-detect objects or remote areas has been improved compared with lidar-only detection.

How to use transformer to effectively correlate lidar-millimeter wave radar-visual features?

Table 5 Ablation experiment of query initialization module

How to use transformer to effectively correlate lidar-millimeter wave radar-visual features?

Table 6 Ablation experiment of fusion part

How to use transformer to effectively correlate lidar-millimeter wave radar-visual features?

Table 7 The distance between the object center and the self-vehicle (in meters)

Conclusion

An effective And a robust Transformer-based lidar camera 3D detection framework with a soft correlation mechanism that can adaptively determine the location and information that should be obtained from the image. TransFusion achieves state-of-the-art results on the nuScenes detection and tracking leaderboards and shows competitive results on the Waymo detection benchmark. Extensive ablation experiments demonstrate the robustness of this method to poor image conditions.

DeepInteraction:

Main contribution:

The main problem solved is that existing multi-modal fusion strategies ignore modality-specific useful information, ultimately hindering model performance. Point clouds provide necessary positioning and geometric information at low resolutions, and images provide rich appearance information at high resolutions, so cross-modal information fusion is particularly important to enhance 3D target detection performance. The existing fusion module, as shown in Figure 1(a), integrates the information of the two modalities into a unified network space. However, doing so will prevent some information from being integrated into a unified representation, which reduces some of the specific information. Representational advantages of modality. In order to overcome the above limitations, the article proposes a new modal interaction module (Figure 1(b)). The key idea is to learn and maintain two modality-specific representations to achieve interaction between modalities. The main contributions are as follows:

Proposes a new modal interaction strategy for multi-modal three-dimensional target detection, aiming to solve the basic problem of previous modal fusion strategies losing useful information in each modality. Limitations;
Designed a DeepInteraction architecture with a multi-modal feature interactive encoder and a multi-modal feature prediction interactive decoder.

How to use transformer to effectively correlate lidar-millimeter wave radar-visual features?

Figure 1 Different fusion strategies

Module details

Multimodality Characterization interactive encoder Customize the encoder into a multiple-input multiple-output (MIMO) structure: take the two modal specific scene information independently extracted by the lidar and camera backbone as input, and generate two enhanced feature information. Each layer of encoder includes: i) multi-modal feature interaction (MMRI); ii) intra-modal feature learning; iii) representation integration.

How to use transformer to effectively correlate lidar-millimeter wave radar-visual features?

Figure 2 Multimodal representation interaction module

How to use transformer to effectively correlate lidar-millimeter wave radar-visual features?

##Figure 3 Multi-modal prediction interaction module

Experiment

The data set and indicators are the same as the nuScenes data set part of TransFusion.

Experimental details The backbone network of the image is ResNet50. In order to save computing costs, the input image is rescaled to 1/2 of the original size before entering the network, and the image branch is frozen during training. the weight of. The voxel size is set to (0.075m, 0.075m, 0.2m), the detection range is set to [-54m, 54m] for the X-axis and Y-axis, and [-5m, 3m] for the Z-axis. Design 2 layers of encoder layers and 5 cascaded decoder layers. In addition, two online submission test models are set up: test time increase (TTA) and model integration, and the two settings are called DeepInteraction-large and DeepInteraction-e respectively. Among them, DeepInteraction-large uses Swin-Tiny as the image backbone network, and doubles the number of channels of the convolution block in the lidar backbone network, sets the voxel size to [0.5m, 0.5m, 0.2m], uses bidirectional flipping and Rotate the yaw angle [0°, ±6.25°, ±12.5°] to increase the test time. DeepInteraction-e integrates multiple DeepInteraction-large models, and the input lidar BEV grid sizes are [0.5m, 0.5m] and [1.5m, 1.5m].

Perform data augmentation according to the configuration of TransFusion: use random rotation in the range [-π/4,π/4], random scaling coefficients [0.9,1.1], and three-axis random translation with standard deviation 0.5 and random horizontal flipping, class-balanced resampling is also used in CBGS to balance the class distribution of nuScenes. The same two-stage training method as TransFusion is used, using TransFusion-L as the baseline for lidar-only training. The Adam optimizer uses a single-cycle learning rate strategy, with a maximum learning rate of 1×10−3, weight attenuation 0.01, momentum 0.85 ~ 0.95, and follows CBGS. The lidar baseline training is 20 rounds, the lidar image fusion is 6 rounds, the batch size is 16, and 8 NVIDIA V100 GPUs are used for training.

Comparison with state-of-the-art methods

How to use transformer to effectively correlate lidar-millimeter wave radar-visual features?

Table 1 Comparison with state-of-the-art methods on the nuScenes test set

As shown in Table 1, DeepInteraction achieves state-of-the-art performance under all settings. Table 2 compares the inference speeds tested on NVIDIA V100, A6000 and A100 respectively. It can be seen that while achieving high performance, a high inference speed is still maintained, which verifies that this method achieves a superior trade-off between detection performance and inference speed.

How to use transformer to effectively correlate lidar-millimeter wave radar-visual features?

Table 2 Inference speed comparison

Ablation experiment

Ablation of decoder Experiment

The designs of multimodal interactive prediction decoder and DETR decoder layers are compared in Table 3(a), and a hybrid design is used: using ordinary DETR decoder layer To aggregate features in the lidar representation, a multimodal predictive decoder for interaction (MMPI) is used to aggregate features in the image representation (second row). MMPI is significantly better than DETR, improving 1.3% mAP and 1.0% NDS, with design combination flexibility. Table 3(c) further explores the impact of different decoder layers on detection performance. It can be found that the performance continues to improve when adding 5 layers of decoders. Finally, different combinations of query numbers used in training and testing were compared. Under different choices, the performance was stable, but 200/300 was used as the optimal setting for training/testing.

How to use transformer to effectively correlate lidar-millimeter wave radar-visual features?

Table 3 Ablation experiment of decoder

Ablation experiment of encoder

From Table 4(a ) can be observed: (1) Compared with IML, the multi-modal representation interactive encoder (MMRI) can significantly improve the performance; (2) MMRI and IML can work well together to further improve the performance. As can be seen from Table 4(b), stacking encoder layers for iterative MMRI is beneficial.

How to use transformer to effectively correlate lidar-millimeter wave radar-visual features?

Table 4 Ablation experiment of encoder

Ablation experiment of lidar backbone network

Using two different lasers Radar backbone networks: PointPillar and VoxelNet to check the generality of the framework. For PointPillars, set the voxel size to (0.2m, 0.2m) while keeping the rest of the settings the same as DeepInteraction-base. Due to the proposed multi-modal interaction strategy, DeepInteraction shows consistent improvements over the lidar-only baseline when using either backbone (5.5% mAP for the voxel-based backbone and 4.4% mAP for the pillar-based backbone) ). This reflects the versatility of DeepInteraction among different point cloud encoders.

How to use transformer to effectively correlate lidar-millimeter wave radar-visual features?

Table 5 Evaluation of different lidar backbones

Conclusion

In this work, it is presented A new 3D target detection method, DeepInteraction, is developed to explore the inherent multi-modal complementary properties. The key idea is to maintain two modality-specific representations and establish an interplay between them for representation learning and predictive decoding. This strategy is specifically designed to address the fundamental limitation of existing one-sided fusion methods, namely that the image representation is underutilized due to its auxiliary source character processing.

Summary of the two papers:

The above two papers are based on three-dimensional target detection based on lidar and camera fusion, which can also be seen from DeepInteraction It is based on further work of TransFusion. From these two papers, we can conclude that one direction of multi-sensor fusion is to explore more efficient dynamic fusion methods to focus on more effective information of different modalities. Of course, all this is based on high-quality information in both modalities. Multimodal fusion will have very important applications in future fields such as autonomous driving and intelligent robots. As the information extracted from different modalities gradually becomes richer, more and more information will be available to us. So how to combine these Using data more efficiently is also a question worth thinking about.

The above is the detailed content of How to use transformer to effectively correlate lidar-millimeter wave radar-visual features?. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

4 weeks ago By DDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

3 weeks ago By DDD

Where to find the Crane Control Keycard in Atomfall

4 weeks ago By DDD

Roblox: Dead Rails - How To Complete Every Challenge

1 months ago By DDD

Atomfall guide: item locations, quest guides, and tips

1 months ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7714

Java Tutorial

1640

CakePHP Tutorial

1395

Laravel Tutorial

1289

PHP Tutorial

1232

Related knowledge

Why is Gaussian Splatting so popular in autonomous driving that NeRF is starting to be abandoned? Jan 17, 2024 pm 02:57 PM

Written above & the author’s personal understanding Three-dimensional Gaussiansplatting (3DGS) is a transformative technology that has emerged in the fields of explicit radiation fields and computer graphics in recent years. This innovative method is characterized by the use of millions of 3D Gaussians, which is very different from the neural radiation field (NeRF) method, which mainly uses an implicit coordinate-based model to map spatial coordinates to pixel values. With its explicit scene representation and differentiable rendering algorithms, 3DGS not only guarantees real-time rendering capabilities, but also introduces an unprecedented level of control and scene editing. This positions 3DGS as a potential game-changer for next-generation 3D reconstruction and representation. To this end, we provide a systematic overview of the latest developments and concerns in the field of 3DGS for the first time.

Learn about 3D Fluent emojis in Microsoft Teams Apr 24, 2023 pm 10:28 PM

You must remember, especially if you are a Teams user, that Microsoft added a new batch of 3DFluent emojis to its work-focused video conferencing app. After Microsoft announced 3D emojis for Teams and Windows last year, the process has actually seen more than 1,800 existing emojis updated for the platform. This big idea and the launch of the 3DFluent emoji update for Teams was first promoted via an official blog post. Latest Teams update brings FluentEmojis to the app Microsoft says the updated 1,800 emojis will be available to us every day

Solution to i7-7700 unable to upgrade to Windows 11 Dec 26, 2023 pm 06:52 PM

The performance of i77700 is completely sufficient to run win11, but users find that their i77700 cannot be upgraded to win11. This is mainly due to restrictions imposed by Microsoft, so they can install it as long as they skip this restriction. i77700 cannot be upgraded to win11: 1. Because Microsoft limits the CPU version. 2. Only the eighth generation and above versions of Intel can directly upgrade to win11. 3. As the 7th generation, i77700 cannot meet the upgrade needs of win11. 4. However, i77700 is completely capable of using win11 smoothly in terms of performance. 5. So you can use the win11 direct installation system of this site. 6. After the download is complete, right-click the file and "load" it. 7. Double-click to run the "One-click

Fall detection, based on skeletal point human action recognition, part of the code is completed with Chatgpt Apr 12, 2023 am 08:19 AM

Hello everyone. Today I would like to share with you a fall detection project, to be precise, it is human movement recognition based on skeletal points. It is roughly divided into three steps: human body recognition, human skeleton point action classification project source code has been packaged, see the end of the article for how to obtain it. 0. chatgpt First, we need to obtain the monitored video stream. This code is relatively fixed. We can directly let chatgpt complete the code written by chatgpt. There is no problem and can be used directly. But when it comes to business tasks later, such as using mediapipe to identify human skeleton points, the code given by chatgpt is incorrect. I think chatgpt can be used as a toolbox that is independent of business logic. You can try to hand it over to c

Choose camera or lidar? A recent review on achieving robust 3D object detection Jan 26, 2024 am 11:18 AM

0.Written in front&& Personal understanding that autonomous driving systems rely on advanced perception, decision-making and control technologies, by using various sensors (such as cameras, lidar, radar, etc.) to perceive the surrounding environment, and using algorithms and models for real-time analysis and decision-making. This enables vehicles to recognize road signs, detect and track other vehicles, predict pedestrian behavior, etc., thereby safely operating and adapting to complex traffic environments. This technology is currently attracting widespread attention and is considered an important development area in the future of transportation. one. But what makes autonomous driving difficult is figuring out how to make the car understand what's going on around it. This requires that the three-dimensional object detection algorithm in the autonomous driving system can accurately perceive and describe objects in the surrounding environment, including their locations,

CLIP-BEVFormer: Explicitly supervise the BEVFormer structure to improve long-tail detection performance Mar 26, 2024 pm 12:41 PM

Written above & the author’s personal understanding: At present, in the entire autonomous driving system, the perception module plays a vital role. The autonomous vehicle driving on the road can only obtain accurate perception results through the perception module. The downstream regulation and control module in the autonomous driving system makes timely and correct judgments and behavioral decisions. Currently, cars with autonomous driving functions are usually equipped with a variety of data information sensors including surround-view camera sensors, lidar sensors, and millimeter-wave radar sensors to collect information in different modalities to achieve accurate perception tasks. The BEV perception algorithm based on pure vision is favored by the industry because of its low hardware cost and easy deployment, and its output results can be easily applied to various downstream tasks.

Paint 3D in Windows 11: Download, Installation, and Usage Guide Apr 26, 2023 am 11:28 AM

When the gossip started spreading that the new Windows 11 was in development, every Microsoft user was curious about how the new operating system would look like and what it would bring. After speculation, Windows 11 is here. The operating system comes with new design and functional changes. In addition to some additions, it comes with feature deprecations and removals. One of the features that doesn't exist in Windows 11 is Paint3D. While it still offers classic Paint, which is good for drawers, doodlers, and doodlers, it abandons Paint3D, which offers extra features ideal for 3D creators. If you are looking for some extra features, we recommend Autodesk Maya as the best 3D design software. like

Get a virtual 3D wife in 30 seconds with a single card! Text to 3D generates a high-precision digital human with clear pore details, seamlessly connecting with Maya, Unity and other production tools May 23, 2023 pm 02:34 PM

ChatGPT has injected a dose of chicken blood into the AI industry, and everything that was once unthinkable has become basic practice today. Text-to-3D, which continues to advance, is regarded as the next hotspot in the AIGC field after Diffusion (images) and GPT (text), and has received unprecedented attention. No, a product called ChatAvatar has been put into low-key public beta, quickly garnering over 700,000 views and attention, and was featured on Spacesoftheweek. △ChatAvatar will also support Imageto3D technology that generates 3D stylized characters from AI-generated single-perspective/multi-perspective original paintings. The 3D model generated by the current beta version has received widespread attention.

See all articles