


SupFusion: Exploring how to effectively supervise Lidar-Camera fused 3D detection networks?
3D detection based on lidar camera fusion is a key task for autonomous driving. In recent years, many lidar camera fusion methods have emerged and achieved good performance, but these methods always lack a well-designed and effectively supervised fusion process
This article introduces a new training strategy called SupFusion , which provides auxiliary feature-level supervision for lidar camera fusion and significantly improves detection performance. The method includes the Polar Sampling data augmentation method for encrypting sparse targets and training auxiliary models to generate high-quality features for supervision. These features are used to train the lidar camera fusion model and optimize the fused features to simulate the generation of high-quality features. Furthermore, a simple yet effective deep fusion module is proposed, which continuously achieves superior performance compared to previous fusion methods using the SupFusion strategy. The method in this paper has the following advantages: First, SupFusion introduces auxiliary feature-level supervision, which can improve the detection performance of lidar cameras without increasing additional inference costs. Secondly, the proposed deep fusion can continuously improve the capabilities of the detector. The proposed SupFusion and deep fusion modules are plug-and-play, and this paper demonstrates their effectiveness through extensive experiments. In the KITTI benchmark for 3D detection based on multiple lidar cameras, approximately 2% 3D mAP improvement was achieved!
Figure 1: Top, previous lidar camera 3D detection model, the fusion module is optimized by detection loss. Bottom: SupFusion proposed in this article introduces auxiliary supervision through high-quality features provided by auxiliary models.
3D detection based on lidar camera fusion is a critical and challenging task in autonomous driving and robotics. Previous methods always project the camera input to the lidar BEV or Voxel space to align lidar and camera features. Then, a simple concatenation or summation is employed to obtain the fused features for final detection. Furthermore, some deep learning based fusion methods have achieved promising performance. However, previous fusion methods always directly optimize the 3D/2D feature extraction and fusion modules through detection loss, which lacks careful design and effective supervision at the feature level, limiting their performance.
In recent years, distillation methods have shown great improvements in feature-level supervision for 3D detection. Some methods provide lidar features to guide the 2D backbone to estimate depth information based on camera input. Additionally, some methods provide lidar camera fusion capabilities to supervise the lidar backbone to learn global and contextual representations from lidar inputs. By introducing feature-level auxiliary supervision by simulating more robust and high-quality features, the detector can promote marginal improvements. Inspired by this, a natural solution to handle lidar camera feature fusion is to provide stronger, high-quality features and introduce auxiliary supervision for lidar camera 3D detection!
In order to improve the performance of fused 3D detection based on lidar cameras, this article proposes a supervised lidar camera fusion method called SupFusion. This method achieves this by generating high-quality features and providing effective supervision for the fusion and feature extraction processes. First, we train an auxiliary model to provide high-quality features. Unlike previous methods that exploit larger models or additional data, we propose a new data augmentation method called Polar Sampling. Polar Sampling dynamically enhances the density of targets from sparse lidar data, making them easier to detect and improving feature quality, such as accurate detection results. We then simply train a detector based on lidar camera fusion and introduce auxiliary feature-level supervision. In this step, we feed the raw lidar and camera inputs into the 3D/2D backbone and fusion module to obtain fused features. The fused features are fed into the detection head for final prediction, while auxiliary supervision models the fused features into high-quality features. These features are obtained through pre-trained auxiliary models and enhanced lidar data. In this way, the proposed feature-level supervision can enable the fusion module to generate more robust features and further improve detection performance. To better fuse the features of lidar and camera, we propose a simple and effective deep fusion module, which consists of stacked MLP blocks and dynamic fusion blocks. SupFusion can fully tap the capabilities of the deep fusion module and continuously improve detection accuracy!
Main contributions of this article:
- Proposed a new supervised fusion training strategy SupFusion, which mainly consists of high-quality feature generation process, and proposed for the first time auxiliary feature-level supervision for robust fusion feature extraction and accurate 3D detection loss.
- In order to obtain high-quality features in SupFusion, a data augmentation method called "Polar Sampling" is proposed to encrypt sparse targets. Furthermore, an effective deep fusion module is proposed to continuously improve detection accuracy.
- Extensive experiments were conducted based on multiple detectors with different fusion strategies and obtained about 2% mAP improvement on the KITTI benchmark.
Proposed method
The high-quality feature generation process is shown in the figure below. For any given LiDAR sample, the sparse encryption is performed by polar pasting. Target, polar pasting computes direction and rotation to query dense targets from the database, and adds extra points for sparse targets via pasting. This paper first trains the auxiliary model with enhanced data and feeds the enhanced lidar data into the auxiliary model to generate high-quality features f* after its convergence.
High-quality feature generation
To provide feature-level supervision in SupFusion, an auxiliary model is adopted to generate the High-quality features are captured in the data, as shown in Figure 3. First, an auxiliary model is trained to provide high-quality features. For any sample in D, the sparse lidar data is augmented to obtain enhanced data by polar pasting, which encrypts the alternate target by adding the set of points generated in the polar grouping. Then, after the auxiliary model converges, the enhanced samples are input into the optimized auxiliary model to capture high-quality features for training the lidar camera 3D detection model. In order to better apply to a given lidar camera detector and make it easier to implement, here we simply adopt the lidar branch detector as an auxiliary model!
Detector training
For any given lidar camera detector, the model is trained using the proposed auxiliary supervision at the feature level. Given samples , , the lidar and camera are first input into the 3D and 2D encoders and to capture the corresponding features and these features are input into the fusion model to generate the fused features, and flows into the detection head for final prediction. Furthermore, the proposed auxiliary supervision is employed to simulate fused features with high-quality features generated from the pre-trained auxiliary model and enhanced lidar data. The above process can be formulated as:
Polar Sampling
In order to provide high-quality features, this paper introduces in the proposed SupFusion A new data augmentation method called Polar Sampling to address the sparsity problem that often causes detection failures. To this end, we perform dense processing of sparse targets in lidar data, similar to how dense targets are processed. Polar coordinate sampling consists of two parts, polar coordinate grouping and polar coordinate pasting. In polar coordinate grouping, we mainly build a database to store dense targets, which is used for polar coordinate pasting, so that sparse targets become denser
Considering the characteristics of the lidar sensor, the collected Point cloud data naturally has a specific density distribution. For example, an object has more points on the surface facing the lidar sensor and fewer points on the opposite sides. The density distribution is mainly affected by orientation and rotation, while the density of points mainly depends on distance. Objects closer to the lidar sensor have a denser density of points. Inspired by this, the goal of this paper is to densify long-distance sparse targets and short-distance dense targets according to the direction and rotation of the sparse targets to maintain the density distribution. We establish a polar coordinate system for the entire scene and the target based on the scene center and the specific target, and define the positive direction of the lidar sensor as 0 degrees to measure the corresponding direction and rotation. We then collect targets with similar density distributions (e.g., have similar orientations and rotations) and generate a dense target for each group in polar groupings and use this in polar paste to dense sparse targets
Polar Grouping
As shown in Figure 4, a database B is constructed here to store the generated dense object point set l according to the direction and rotation in the polar grouping. In Figure Denoted as α and β in 4!
First, the entire dataset is searched, polar angles are calculated for all targets by position and rotation is provided in the datum. Second, divide the targets into groups based on their polar angles. Manually divide the orientation and rotation into N groups, and for any target point set l, you can put it into the corresponding group according to the index:
Polar Pasting
As shown in Figure 2, Polar Pasting is used to enhance sparse lidar data to train auxiliary models and generate high-definition quality characteristics. Given LiDAR sample ,,,, contains targets, for any target, the same orientation and rotation as in the grouping process can be calculated and the dense targets from B can be queried based on the label and index, which can be obtained from E.q.6 Obtain all targets in the enhanced sample and obtain enhanced data.
Deep Fusion
To simulate the high-quality features generated by enhanced LiDAR data, the fusion model is designed to generate from camera inputs Extracting missing information of sparse objects from rich color and contextual features. To this end, this paper proposes a deep fusion module to utilize image features and complete lidar demonstrations. The proposed deep fusion mainly consists of 3D learner and 2D-3D learner. The 3D learner is a simple convolutional layer used to transfer 3D renderings into 2D space. Then, to connect 2D features and 3D renderings (e.g., in 2D space), a 2D-3D learner is used to fuse LiDAR camera features. Finally, the fused features are weighted by MLP and activation functions, which are added back to the original lidar features as the output of the deep fusion module. The 2D-3D learner consists of stacked MLP blocks of depth K and learns to leverage camera features to complete lidar representations of sparse targets to simulate high-quality features of dense lidar targets.
Experimental comparative analysis
Experimental results (mAP@R40%). Listed here are three categories of easy, medium (mod.) and hard cases, as well as overall performance. Here L, LC, LC* represent the corresponding lidar detector, lidar camera fusion detector and the results of this paper’s proposal. Δ represents improvement. The best results are shown in bold, where L is expected to be the auxiliary model and tested on the augmented validation set. MVXNet is re-implemented based on mmdetection3d. PV-RCNN-LC and Voxel RCNN LC are re-implemented based on the open source code of VFF.
Rewritten content: Overall performance. According to the comparison results in Table 1, the comparison of 3DmAP@R40 based on three detectors shows the overall performance of each category and each difficulty division. It can be clearly observed that by introducing an additional camera input, the lidar camera method (LC) outperforms the lidar-based detector (L) in performance. By introducing polar sampling, the auxiliary model (L†) shows admirable performance on the augmented validation set (e.g., over 90% mAP). With auxiliary supervision with high-quality features and the proposed deep fusion module, our proposal continuously improves detection accuracy. For example, compared to the baseline (LC) model, our proposal achieves 1.54% and 1.24% 3D mAP improvements on medium and hard targets, respectively. In addition, we also conducted experiments on the nuScenes benchmark based on SECOND-LC, as shown in Table 2, NDS and mAP improved by 2.01% and 1.38% respectively
class perception Improve analytics. Compared to the baseline model, SupFusion and deep fusion not only improve the overall performance but also improve the detection performance of each category including Pedestrian. Comparing the average improvement across three categories (e.g. medium case), the following observations can be made: Cyclists saw the largest improvement (2.41%), while pedestrians and cars saw improvements of 1.35% and 0.86% respectively. The reasons are obvious: (1) Cars are easier to spot and get the best results from than pedestrians and cyclists, and therefore harder to improve. (2) Cyclists gain more improvements compared to pedestrians because pedestrians are non-grid and generate less dense targets than cyclists and therefore gain lower performance improvements!
Please click the following link to view the original content: https://mp.weixin.qq.com/s/vWew2p9TrnzK256y-A4UFw
The above is the detailed content of SupFusion: Exploring how to effectively supervise Lidar-Camera fused 3D detection networks?. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



0.What does this article do? We propose DepthFM: a versatile and fast state-of-the-art generative monocular depth estimation model. In addition to traditional depth estimation tasks, DepthFM also demonstrates state-of-the-art capabilities in downstream tasks such as depth inpainting. DepthFM is efficient and can synthesize depth maps within a few inference steps. Let’s read about this work together ~ 1. Paper information title: DepthFM: FastMonocularDepthEstimationwithFlowMatching Author: MingGui, JohannesS.Fischer, UlrichPrestel, PingchuanMa, Dmytr

Yesterday during the interview, I was asked whether I had done any long-tail related questions, so I thought I would give a brief summary. The long-tail problem of autonomous driving refers to edge cases in autonomous vehicles, that is, possible scenarios with a low probability of occurrence. The perceived long-tail problem is one of the main reasons currently limiting the operational design domain of single-vehicle intelligent autonomous vehicles. The underlying architecture and most technical issues of autonomous driving have been solved, and the remaining 5% of long-tail problems have gradually become the key to restricting the development of autonomous driving. These problems include a variety of fragmented scenarios, extreme situations, and unpredictable human behavior. The "long tail" of edge scenarios in autonomous driving refers to edge cases in autonomous vehicles (AVs). Edge cases are possible scenarios with a low probability of occurrence. these rare events

Boston Dynamics Atlas officially enters the era of electric robots! Yesterday, the hydraulic Atlas just "tearfully" withdrew from the stage of history. Today, Boston Dynamics announced that the electric Atlas is on the job. It seems that in the field of commercial humanoid robots, Boston Dynamics is determined to compete with Tesla. After the new video was released, it had already been viewed by more than one million people in just ten hours. The old people leave and new roles appear. This is a historical necessity. There is no doubt that this year is the explosive year of humanoid robots. Netizens commented: The advancement of robots has made this year's opening ceremony look like a human, and the degree of freedom is far greater than that of humans. But is this really not a horror movie? At the beginning of the video, Atlas is lying calmly on the ground, seemingly on his back. What follows is jaw-dropping

In the past month, due to some well-known reasons, I have had very intensive exchanges with various teachers and classmates in the industry. An inevitable topic in the exchange is naturally end-to-end and the popular Tesla FSDV12. I would like to take this opportunity to sort out some of my thoughts and opinions at this moment for your reference and discussion. How to define an end-to-end autonomous driving system, and what problems should be expected to be solved end-to-end? According to the most traditional definition, an end-to-end system refers to a system that inputs raw information from sensors and directly outputs variables of concern to the task. For example, in image recognition, CNN can be called end-to-end compared to the traditional feature extractor + classifier method. In autonomous driving tasks, input data from various sensors (camera/LiDAR

I cry to death. The world is madly building big models. The data on the Internet is not enough. It is not enough at all. The training model looks like "The Hunger Games", and AI researchers around the world are worrying about how to feed these data voracious eaters. This problem is particularly prominent in multi-modal tasks. At a time when nothing could be done, a start-up team from the Department of Renmin University of China used its own new model to become the first in China to make "model-generated data feed itself" a reality. Moreover, it is a two-pronged approach on the understanding side and the generation side. Both sides can generate high-quality, multi-modal new data and provide data feedback to the model itself. What is a model? Awaker 1.0, a large multi-modal model that just appeared on the Zhongguancun Forum. Who is the team? Sophon engine. Founded by Gao Yizhao, a doctoral student at Renmin University’s Hillhouse School of Artificial Intelligence.

What? Is Zootopia brought into reality by domestic AI? Exposed together with the video is a new large-scale domestic video generation model called "Keling". Sora uses a similar technical route and combines a number of self-developed technological innovations to produce videos that not only have large and reasonable movements, but also simulate the characteristics of the physical world and have strong conceptual combination capabilities and imagination. According to the data, Keling supports the generation of ultra-long videos of up to 2 minutes at 30fps, with resolutions up to 1080p, and supports multiple aspect ratios. Another important point is that Keling is not a demo or video result demonstration released by the laboratory, but a product-level application launched by Kuaishou, a leading player in the short video field. Moreover, the main focus is to be pragmatic, not to write blank checks, and to go online as soon as it is released. The large model of Ke Ling is already available in Kuaiying.

Target detection is a relatively mature problem in autonomous driving systems, among which pedestrian detection is one of the earliest algorithms to be deployed. Very comprehensive research has been carried out in most papers. However, distance perception using fisheye cameras for surround view is relatively less studied. Due to large radial distortion, standard bounding box representation is difficult to implement in fisheye cameras. To alleviate the above description, we explore extended bounding box, ellipse, and general polygon designs into polar/angular representations and define an instance segmentation mIOU metric to analyze these representations. The proposed model fisheyeDetNet with polygonal shape outperforms other models and simultaneously achieves 49.5% mAP on the Valeo fisheye camera dataset for autonomous driving

Written in front & starting point The end-to-end paradigm uses a unified framework to achieve multi-tasking in autonomous driving systems. Despite the simplicity and clarity of this paradigm, the performance of end-to-end autonomous driving methods on subtasks still lags far behind single-task methods. At the same time, the dense bird's-eye view (BEV) features widely used in previous end-to-end methods make it difficult to scale to more modalities or tasks. A sparse search-centric end-to-end autonomous driving paradigm (SparseAD) is proposed here, in which sparse search fully represents the entire driving scenario, including space, time, and tasks, without any dense BEV representation. Specifically, a unified sparse architecture is designed for task awareness including detection, tracking, and online mapping. In addition, heavy
