The end-to-end paradigm uses a unified framework to achieve multi-tasking in the autonomous driving system. Despite the simplicity and clarity of this paradigm, the performance of end-to-end autonomous driving methods on subtasks still lags far behind single-task methods. At the same time, the dense bird's-eye view (BEV) features widely used in previous end-to-end methods make it difficult to scale to more modalities or tasks. A sparse search-centric end-to-end autonomous driving paradigm (SparseAD) is proposed here, in which sparse search fully represents the entire driving scenario, including space, time, and tasks, without any dense BEV representation. Specifically, a unified sparse architecture is designed for task awareness including detection, tracking, and online mapping. Furthermore, motion prediction and planning are revisited, while a more reasonable motion planning framework is designed. On the challenging nuScenes dataset, SparseAD achieves state-of-the-art full-task performance in an end-to-end approach and reduces the performance gap between the end-to-end paradigm and single-task approaches.
Autonomous driving systems need to make correct decisions in complex driving scenarios to ensure driving safety and comfort. Typically, autonomous driving systems integrate multiple tasks such as detection, tracking, online mapping, motion prediction, and planning. As shown in Figure 1a, the traditional modular paradigm splits complex systems into multiple individual tasks, each of which is optimized independently. In this paradigm, manual post-processing is required between independent single-task modules, which makes the entire process more cumbersome. On the other hand, due to the loss of scene information compression between stacked tasks, errors in the entire system accumulate, which may lead to potential safety issues.
Regarding the above issues, the end-to-end autonomous driving system takes raw perceptron data as input and returns the planning results in a more concise way. Early work proposed skipping intermediate tasks and predicting planning results directly from raw perceptron data. Although this approach is more straightforward, it is not satisfactory in terms of model optimization, interpretability, and planning performance. Another multi-faceted paradigm with better interpretability is to integrate multiple parts of autonomous driving into a modular end-to-end model, which introduces multi-dimensional supervision to improve the understanding of complex driving scenarios, And brings the ability to multi-task.
As shown in Figure 1b, in most advanced modular end-to-end methods, the entire driving scenario is characterized by a dense collection of Bird’s Eye View (BEV) features, which include multi-sensor and temporal information, and serve as input to full-stack driver tasks including sensing, prediction, and planning. Although densely aggregated BEV features play a key role in achieving multi-modality and multi-tasking across space and time, previous end-to-end methods using BEV representation are summarized as the Dense BEV-Centric paradigm. Despite the simplicity and interpretability of these methods, their performance on each subtask of autonomous driving still lags far behind the corresponding single-task methods. In addition, under the Dense BEV-Centric paradigm, long-term temporal fusion and multi-modal fusion are mainly achieved through multiple BEV feature maps, which leads to a significant increase in computing costs and memory usage, and brings a greater burden to actual deployment. .
A novel sparse search-centered end-to-end automatic driving paradigm (SparseAD) is proposed here. In this paradigm, the spatial and temporal elements in the entire driving scene are represented by sparse lookup tables, abandoning the traditional dense ensemble Bird's Eye View (BEV) feature, as shown in Figure 1c. This sparse representation enables end-to-end models to more efficiently utilize longer historical information and scale to more modes and tasks while significantly reducing computational cost and memory footprint.
The modular end-to-end architecture has been redesigned and simplified into a concise structure consisting of sparse sensing and motion planners. In the sparse perception module, a universal temporal decoder is utilized to unify perception tasks including detection, tracking and online mapping. In this process, multi-sensor features and historical records are treated as tokens, while object queries and map queries represent obstacles and road elements in the driving scene respectively. In the motion planner, sparse perception queries are used as environment representation, and multi-modal motion predictions are performed on the vehicle and surrounding agents simultaneously to obtain multiple initial planning solutions for the self-vehicle. Subsequently, multi-dimensional driving constraints are fully considered to generate the final planning results.
Main contributions:
As shown in Figure 1c, in the proposed sparse query-centered paradigm, different sparse queries completely represent the entire The driving scene is not only responsible for information transfer and interaction between modules, but also propagates reverse gradients in multi-tasks for optimization in an end-to-end manner. Different from previous dense set bird's-eye view (BEV)-centered methods, no view projection and dense BEV features are used in SparseAD, thus avoiding heavy computational and memory burdens. The detailed architecture of SparseAD is shown in Figure 2.
From the architectural diagram, SparseAD mainly consists of three parts, including sensor encoder, sparse perception and motion planner. Specifically, the sensor encoder takes as input multi-view camera images, radar or lidar points and encodes them into high-dimensional features. These features are then input into the sparse sensing module as sensor tokens along with position embeddings (PE). In the sparse sensing module, raw data from sensors will be aggregated into a variety of sparse sensing queries, such as detection queries, tracking queries, and map queries, which respectively represent different elements in the driving scene and will be further propagated to downstream tasks. In the motion planner, the perception query is treated as a sparse representation of the driving scene and is fully exploited for all surrounding agents and the self-vehicle. At the same time, multiple driving constraints are considered to generate a final plan that is both safe and dynamically compliant.
In addition, an end-to-end multi-task memory library is introduced in the architecture to uniformly store the timing information of the entire driving scene, which allows the system to benefit from the aggregation of long-term historical information to complete full-stack driving tasks .
As shown in Figure 3, SparseAD’s sparse perception module unifies multiple perception tasks in a sparse manner, including detection, tracking and online mapping. Specifically, there are two structurally identical temporal decoders that exploit long-term historical information from the memory bank. One of the decoders is used for obstacle sensing and the other is used for online mapping.
After information aggregation through perception queries corresponding to different tasks, the detection and tracking heads and the map part are used to decode and output obstacles and map elements respectively. After that, an update process is performed, which filters and saves the high-confidence sensing query of the current frame and updates the memory bank accordingly, which will benefit the sensing process of the next frame.
In this way, SparseAD’s sparse perception module achieves efficient and accurate perception of the driving scene, providing an important information basis for subsequent motion planning. At the same time, by utilizing historical information in the memory bank, the module can further improve the accuracy and stability of perception and ensure the reliable operation of the autonomous driving system.
In terms of obstacle perception, joint detection and tracking are adopted within a unified decoder without any additional manual post-processing. There is a significant imbalance between detection and tracking queries, which can lead to significant degradation in detection performance. In order to alleviate the above problems, the performance of obstacle sensing has been improved from multiple angles. First, a two-level memory mechanism is introduced to propagate temporal information across frames. Among them, scene-level memory maintains query information without cross-frame correlation, while instance-level memory maintains the correspondence between adjacent frames of tracking obstacles. Secondly, considering the different origins and tasks of the two, different update strategies are adopted for scene-level and instance-level memories. Specifically, scene-level memory is updated via MLN, while instance-level memory is updated with future predictions for each obstacle. Furthermore, during the training process, an enhancement strategy is also adopted for tracking queries to balance the supervision between the two levels of memory, thereby enhancing detection and tracking performance. Afterwards, by detecting and tracking the head, a 3D bounding box with attributes and a unique ID can be decoded from the detection or tracking query and then further used in downstream tasks.
Online map construction is a complex and important task. According to current knowledge, existing online map construction methods mostly rely on dense bird's-eye view (BEV) features to represent the driving environment. This approach has difficulties in extending the sensing range or leveraging historical information because it requires large amounts of memory and computing resources. We firmly believe that all map elements can be represented in a sparse manner, therefore, we try to complete online map construction under the sparse paradigm. Specifically, the same temporal decoder structure as in the obstacle perception task is adopted. Initially, map queries with prior categories are initialized to be uniformly distributed on the driving plane. In the temporal decoder, map queries interact with sensor markers and historical memory markers. These historical memory markers are actually composed of highly confident map queries from previous frames. The updated map query then carries valid information about the map elements of the current frame and can be pushed to the memory bank for use in future frames or downstream tasks.
Obviously, the process of online map construction is roughly the same as obstacle perception. That is, sensing tasks including detection, tracking, and online map construction are unified into a common sparse approach that is more efficient when scaling to larger ranges (e.g., 100m × 100m) or long-term fusion , and does not require any complex operations (such as deformable attention or multi-point attention). To the best of our knowledge, this is the first to implement online map construction in a unified perception architecture in a sparse manner. Subsequently, the piecewise Bezier map Head is used to return the piecewise Bezier control points of each sparse map element, and these control points can be easily transformed to meet the requirements of downstream tasks.
We re-examined the problem of motion prediction and planning in autonomous driving systems and found that many previous methods ignored this problem when predicting the motion of surrounding vehicles. The dynamics of the ego-vehicle. While this may not be apparent in most situations, it can be a potential risk in scenarios such as intersections where there is close interaction between nearby vehicles and the host vehicle. Inspired by this, a more reasonable motion planning framework was designed. In this framework, the motion predictor predicts the motion of surrounding vehicles and the own vehicle simultaneously. Subsequently, the prediction results of the own vehicle are used as motion priors in subsequent planning optimizers. During the planning process, we consider different aspects of constraints to produce a final planning result that meets both safety and dynamics requirements.
As shown in Figure 4, the motion planner in SparseAD treats perception queries (including trajectory queries and map queries) as a sparse representation of the current driving scene. Multimodal motion queries are used as a medium to enable understanding of driving scenarios, perception of interactions between all vehicles (including the own vehicle), and gaming of different future possibilities. The vehicle's multimodal motion query is then fed into a planning optimizer, which takes into account driving constraints including high-level instructions, safety and dynamics.
Motion Predictor. Following previous methods, the perception and integration between motion queries and current driving scene representations (including trajectory queries and map queries) are achieved through standard transformer layers. In addition, self-vehicle agent and cross-modal interaction are applied to jointly model the interaction between surrounding agents and the self-vehicle in future spatio-temporal scenes. Through module synergy within and between multi-layer stacking structures, motion queries are able to aggregate rich semantic information from both static and dynamic environments.
In addition to the above, two strategies are also introduced to further improve the performance of motion predictors. First, a simple and straightforward prediction is made using the instance-level temporal memory of the trajectory query as part of the initialization of the surrounding agent motion query. In this way, motion predictors are able to benefit from prior knowledge gained from upstream tasks. Second, thanks to the end-to-end memory library, useful information can be assimilated from the saved historical motion queries in a streaming manner through the agent memory aggregator at almost negligible cost.
It should be noted that the multi-modal motion query of this vehicle is updated at the same time. In this way, the motion prior of the own vehicle can be obtained, which can further facilitate the planning learning process.
Planning Optimizer. With the motion prior provided by the motion predictor, better initialization is obtained, resulting in fewer detours during training. As a key component of the motion planner, the design of the cost function is crucial as it will greatly affect or even determine the quality of the final performance. In the proposed SparseAD motion planner, two major constraints, safety and dynamics, are mainly considered, aiming to generate satisfactory planning results. Specifically, in addition to the constraints determined in VAD, it also focuses on the dynamic safety relationship between the vehicle and nearby agents, and considers their relative positions in future moments. For example, if agent i continues to remain in the front left area relative to the vehicle, thereby preventing the vehicle from changing lanes to the left, then agent i will obtain a left label, indicating that agent i imposes a leftward constraint on the vehicle. Constraints are therefore classified as front, back, or none in the longitudinal direction, and as left, right, or none in the transverse direction. In the planner, we decode the relationship between other agents and the vehicle in the horizontal and vertical directions from the corresponding query. This process involves determining the probabilities of all constraints between other agents and the own vehicle in these directions. Then, we utilize focal loss as the cost function of the Ego-Agent relationship (EAR) to effectively capture the potential risks brought by nearby agents:
Since the planned trajectory must follow the control The dynamic laws of system execution embed auxiliary tasks in the motion planner to promote the learning of the vehicle's dynamic state. Query Qego to decode states such as speed, acceleration, and yaw angle from the own vehicle, and use dynamics losses to supervise these states:
Extensive experiments were conducted on the nuScenes dataset to demonstrate the effectiveness and superiority of the method. To be fair, the performance of each complete task will be evaluated and compared with previous methods. The experiments in this section use three different configurations of SparseAD, namely SparseAD-B and SparseAD-L that only use image input, and SparseAD-BR that uses radar point cloud and image multi-modal input. Both SparseAD-B and SparseAD-BR use V2-99 as the image backbone network, and the input image resolution is 1600 × 640. SparseAD-L further uses ViTLarge as the image backbone network, and the input image resolution is 1600×800.
The 3D detection and 3D multi-target tracking results on the nuScenes validation data set are as follows. "Tracking only methods" refers to methods that are tracked through post-processing correlation. “End-to-end autonomous driving method” refers to a method that is capable of full-stack autonomous driving tasks. All methods in the table are evaluated with full resolution image input. †: The results are reproduced through official open source code. -R: Indicates that radar point cloud input is used.
The performance comparison with the online mapping method is as follows. The results are evaluated under the threshold of [1.0m, 1.5m, 2.0m]. ‡: Result reproduced through official open source code. †: Based on the needs of the planning module in SparseAD, we further subdivided the boundary into road segments and lanes and evaluated them separately. ∗: Cost of backbone network and sparse sensing module. -R: Indicates that radar point cloud input is used.
Obstacle Perception. The detection and tracking performance of SparseAD is compared with other methods on the nuScenes validation set in Tab. 2. Obviously, SparseAD-B performs well in most popular detection-only, tracking-only and end-to-end multi-object tracking methods, while performing equivalently to SOTA methods such as StreamPETR and QTrack on the corresponding tasks. By scaling up with a more advanced backbone network, SparseAD-Large achieves overall better performance, with mAP of 53.6%, NDS of 62.5%, and AMOTA of 60.6%, which is overall better than the previous best method Sparse4Dv3.
Online mapping. Tab. 3 shows the comparison results of online mapping performance between SparseAD and other previous methods on the nuScenes validation set. It should be pointed out that according to planning needs, we subdivided the boundary into road segments and lanes and evaluated them separately, while extending the range from the usual 60m × 30m to 102.4m × 102.4m to be consistent with obstacle perception. Without losing fairness, SparseAD achieves 34.2% mAP in a sparse end-to-end manner without any dense BEV representation, which is better than most previously popular methods, such as HDMapNet, VectorMapNet and MapTR, in terms of performance It has obvious advantages in terms of training cost and cost. Although the performance is slightly inferior to StreamMapNet, our method demonstrates that online mapping can be done in a uniform sparse manner without any dense BEV representation, which has implications for practical deployment of end-to-end autonomous driving at significantly lower cost. Significance. Admittedly, how to effectively utilize useful information from other modalities (such as radar) is still a task worthy of further exploration. We believe there is still much room for exploration in a sparse manner.
Motion prediction. The comparison results of motion prediction are shown in Tab. 4a, where the indicators are consistent with VIP3D. SparseAD achieves the best performance among all end-to-end methods, with the lowest 0.83m minADE, 1.58m minFDE, 18.7% miss rate, and the highest 0.308 EPA, which is a huge advantage. In addition, thanks to the efficiency and scalability of the sparse query center paradigm, SparseAD can effectively expand to more modalities and benefit from the advanced backbone network to further significantly improve prediction performance.
planning. The results of the planning are presented in Tab. 4b. Thanks to the superior design of the upstream perception module and motion planner, all versions of SparseAD achieve state-of-the-art performance on the nuScenes validation dataset. Specifically, SparseAD-B achieves the lowest average L2 error and collision rate compared to all other methods including UniAD and VAD, which demonstrates the superiority of our approach and architecture. Similar to upstream tasks including obstacle perception and motion prediction, SparseAD further improves performance with radar or a more powerful backbone network.
The above is the detailed content of nuScenes' latest SOTA | SparseAD: Sparse query helps efficient end-to-end autonomous driving!. For more information, please follow other related articles on the PHP Chinese website!