New title: Sparse4D v3: Advancing end-to-end 3D detection and tracking technology
Paper link: https://arxiv.org/pdf/2311.11722.pdf
Needs to be rewritten The content is: Code link: https://github.com/linxuewu/Sparse4D
Rewritten content: The author’s affiliation is Horizon Company
In the autonomous driving perception system, 3D detection and tracking are two basic tasks. This article takes a deeper look into this area based on the Sparse4D framework. This article introduces two auxiliary training tasks (temporal instance denoising-Temporal Instance Denoising and quality estimation-Quality Estimation), and proposes decoupled attention (decoupled attention) for structural improvement, thereby significantly improving detection performance. Furthermore, this paper extends the detector to the tracker using a simple method that assigns instance IDs during inference, further highlighting the advantages of query-based algorithms. Extensive experiments on the nuScenes benchmark validate the effectiveness of the proposed improvements. Using ResNet50 as the backbone, mAP, NDS and AMOTA increased by 3.0%, 2.2% and 7.6% respectively, reaching 46.9%, 56.1% and 49.0% respectively. The best model in this article achieved 71.9% NDS and 67.7% AMOTA on the nuScenes test set
Sparse4D-v3 is a powerful 3D perception framework , which proposes three effective strategies: temporal instance denoising, quality estimation, and decoupled attention
This article extends Sparse4D into an end-to-end tracking model.
This paper demonstrates the effectiveness of nuScenes improvements, achieving state-of-the-art performance in detection and tracking tasks.
First, it is observed that sparse algorithms face greater challenges in convergence compared to dense algorithms, thus affecting the final performance. This problem has been well studied in the field of 2D detection [17, 48, 53], mainly because sparse algorithms use one-to-one positive sample matching. This matching method is unstable in the early stages of training, and compared with one-to-many matching, the number of positive samples is limited, thus reducing the efficiency of decoder training. Furthermore, Sparse4D uses sparse feature sampling instead of global cross-attention, which further hinders the convergence of the encoder due to the scarcity of positive samples. In Sparse4Dv2, dense deep supervision is introduced to partially alleviate these convergence issues faced by image encoders. The main goal of this paper is to enhance model performance by focusing on the stability of decoder training. This paper uses the denoising task as auxiliary supervision and extends the denoising technology from 2D single frame detection to 3D temporal detection. This not only ensures stable positive sample matching, but also significantly increases the number of positive samples. In addition, this paper also introduces a quality assessment task as auxiliary supervision. This makes the output confidence score more reasonable, improves the accuracy of detection result ranking, and thus obtains higher evaluation indicators. In addition, this article improves the structure of the instance self-attention and temporal cross-attention modules in Sparse4D, and introduces a decoupled attention mechanism aimed at reducing feature interference in the attention weight calculation process. By using anchor embeddings and instance features as inputs to the attention calculation, instances with outliers in the attention weights can be reduced. This can more accurately reflect the correlation between target features, thereby achieving correct feature aggregation. This paper uses connections instead of attention mechanisms to significantly reduce this error. This augmentation method has similarities with conditional DETR, but the key difference is that this paper emphasizes attention between queries, while conditional DETR focuses on cross-attention between queries and image features. In addition, this article also involves a unique encoding method
In order to improve the end-to-end capabilities of the perception system, this article studies the method of integrating 3D multi-target tracking tasks into the Sparse4D framework to directly output the target's motion trajectory. Unlike detection-based tracking methods, this paper integrates all tracking functions into the detector by eliminating the need for data association and filtering. Furthermore, unlike existing joint detection and tracking methods, our tracker does not require modification or adjustment of the loss function during training. It does not require providing ground truth IDs, but implements predefined instance-to-track regression. The tracking implementation of this article fully integrates the detector and the tracker, without modifying the training process of the detector, and without additional fine-tuning
This is Figure 1 about the overview of the Sparse4D framework , the input is a multi-view video, and the output is the perceptual result of all frames
Figure 2: Inference efficiency (FPS) - perceptual performance (FPS) on the nuScenes validation data set of different algorithms mAP).
Figure 3: Visualization of attention weights in instance self-attention: 1) The first row shows the attention weights in ordinary self-attention, where the pedestrian in the red circle is shown to be in line with the target vehicle (green box) unexpected correlation. 2) The second row shows the attention weight in decoupled attention, which effectively solves this problem.
The fourth picture shows an example of time series instance denoising. During the training phase, instances consist of two parts: learnable and noisy. Noise instances are composed of temporal and non-temporal elements. This paper adopts a pre-matching method to allocate positive and negative samples, that is, matching anchors with ground truth, while learnable instances are matched with predictions and ground truth. During the testing phase, only green blocks remain. In order to prevent features from spreading between groups, an Attention mask is used. Gray indicates that there is no attention between queries and keys, and green indicates the opposite.
Please see Figure 5: Anchor points Architectures for encoders and attention. This paper independently encodes high-dimensional features of multiple components of anchors and then concatenates them. This approach reduces computational and parameter overhead compared to the original Sparse4D. E and F represent anchor embedding and instance features respectively
This article first proposes a method to enhance the detection performance of Sparse4D. This enhancement mainly includes three aspects: temporal instance denoising, quality estimation and decoupled attention. Subsequently, the article explains the process of extending Sparse4D into an end-to-end tracking model. This article's experiments on nuScenes show that these enhancements significantly improve performance, placing Sparse4Dv3 at the forefront of the field.
Lin, X., Pei, Z., Lin, T., Huang, L., & Su, Z. (2023). Sparse4D v3: Advancing End-to-End 3D Detection and Tracking. ArXiv. /abs/2311.11722
The above is the detailed content of Sparse4D v3 is here! Advancing end-to-end 3D detection and tracking. For more information, please follow other related articles on the PHP Chinese website!