Based on LiDAR point cloud point 3D Object Detection is a very classic problem. Both academia and industry have proposed various models to improve accuracy, speed and robustness. However, due to the complex outdoor environment, the performance of Object Detection for outdoor point clouds is not very good. Lidar point clouds are sparse in nature. How to solve this problem in a targeted manner? The paper gives its own answer: extract information based on the aggregation of time series information.
This paper mainly discusses an important challenge facing autonomous driving: how to accurately establish the surrounding environment three-dimensional representation. This is critical to ensuring the reliability and safety of autonomous vehicles. In particular, autonomous vehicles need to be able to recognize surrounding objects, such as vehicles and pedestrians, and accurately determine their location, size, and orientation. Typically, people use deep neural networks to process LiDAR data to accomplish this task.
Current research mainly focuses on single-frame methods, which use data from one sensor scan at a time. This method performs well on classic benchmarks, detecting objects at distances up to 75 meters. However, the sparseness of lidar point clouds is particularly evident at long ranges. Therefore, the researchers believe that relying solely on a single scan for long-distance detection is not enough, for example, up to a distance of 200 meters. Therefore, future research needs to focus on addressing this challenge.
To solve this problem, one method is to use point cloud aggregation, which is to concatenate a series of lidar scan data to obtain a denser input. However, this approach is computationally expensive and does not take full advantage of aggregation within the network. To reduce computational costs and better utilize information, consider using recursive methods. Recursive methods accumulate information over time and produce more accurate outputs by iteratively fusing the current input with previous aggregated results. This method can not only improve calculation efficiency, but also effectively utilize historical information to improve prediction accuracy. Recursive methods have wide applications in point cloud aggregation problems and have achieved satisfactory results.
The article also mentioned that in order to increase the detection range, some advanced operations can be adopted, such as sparse convolution, attention module and 3D convolution. However, these operations usually ignore the compatibility issues of the target hardware. When deploying and training neural networks, the hardware used often differs significantly in supported operations and latency. For example, target hardware such as Nvidia Orin DLA often does not support operations such as sparse convolution or attention. Additionally, using layers such as 3D convolutions is often not feasible due to real-time latency requirements. This emphasizes the need to use simple operations such as 2D convolution.
The paper proposes a new temporal recursive model, TimePillars, which respects the set of operations supported on common target hardware, relies on 2D convolution, is based on point-pillar (Pillar) input representation and a convolution recursion unit. Self-motion compensation is applied to the hidden state of the recurrent unit with the help of a single convolution and auxiliary learning. The use of auxiliary tasks to ensure the correctness of this manipulation has been shown to be appropriate through ablation studies. The paper also investigates the optimal placement of the recursive module in the pipeline and clearly shows that placing it between the backbone of the network and the detection head results in the best performance. On the newly released Zenseact Open Dataset (ZOD), the paper demonstrates the effectiveness of the TimePillars method. Compared to single-frame and multi-frame point-and-pillar baselines, TimePillars achieves significant evaluation performance improvements, especially at long-range (up to 200 meters) detection in the important cyclist and pedestrian categories. Finally, TimePillars have significantly lower latency than multi-frame point pillars, making them suitable for real-time systems.
This paper proposes a new temporal recursive model called TimePillars to solve the 3D lidar object detection task and considers the set of operations supported by common target hardware. Experiments have proven that TimePillars achieves significantly better performance than single-frame and multi-frame point-pillar baselines in long-distance detection. In addition, the paper also benchmarks a 3D lidar object detection model on the Zenseact open dataset for the first time. However, limitations of the paper are that it only focuses on lidar data, does not consider other sensor inputs, and bases its approach on a single state-of-the-art baseline. Nonetheless, the authors believe that their framework is general, i.e., future improvements to the baseline will translate into overall performance improvements.
In the "Input Preprocessing" section of this paper, the author uses a technique called "pillarization" to process the input points Cloud data. Different from conventional voxelization, this method segments the point cloud into vertical columnar structures, segmenting only in the horizontal direction (x and y axes) while maintaining a fixed height in the vertical direction (z axis). The advantage of this processing method is that it can maintain the consistency of the network input size and can use 2D convolution for efficient processing. In this way, point cloud data can be processed efficiently, providing more accurate and reliable input for subsequent tasks.
However, one problem with Pillarisation is that it produces many empty columns, resulting in very sparse data. To solve this problem, the paper proposes the use of dynamic voxelization technology. This technique avoids the need to have a predefined number of points for each column, thereby eliminating the need for truncation or filling operations on each column. Instead, the entire point cloud data is processed as a whole to match the required total number of points, here set to 200,000 points. The benefit of this preprocessing method is that it minimizes the loss of information and makes the generated data representation more stable and consistent.
Then for the Model architecture, the author introduced in detail a pillar feature encoder (Pillar Feature Encoder), 2D convolutional neural network (CNN) backbone and a neural network architecture composed of detection heads.
In this part of the paper, the author discusses how to process the hidden state features output by the convolutional GRU, which are previously Represented by the coordinate system of a frame. If stored directly and used to calculate the next prediction, a spatial mismatch will occur due to ego-motion.
In order to perform the conversion, different techniques can be applied. Ideally, the corrected data would be fed into the network rather than transformed within the network. However, this is not the method proposed in the paper, as it requires resetting the hidden states at each step in the inference process, transforming the previous point clouds, and propagating them throughout the network. Not only is this inefficient, it defeats the purpose of using RNNs. Therefore, in a loop context, compensation needs to be done at the feature level. This makes the hypothetical solution more efficient, but also makes the problem more complex. Traditional interpolation methods can be used to obtain features in transformed coordinate systems.
In contrast, the paper, inspired by the work of Chen et al., proposes to use convolution operations and auxiliary tasks to perform transformations. Considering the limited details of the aforementioned work, the paper proposes a customized solution to this problem.
The approach taken by the paper is to provide the network with the information needed to perform feature transformation through an additional convolutional layer. The relative transformation matrix between two consecutive frames is first calculated, i.e. the operations required to successfully transform features. Then, extract the 2D information (rotation and translation part) from it:
This simplification avoids the main matrix constants and works in the 2D (pseudo-image) domain, reducing 16 values to 6. The matrix is then flattened and expanded to match the shape of the hidden features to be compensated . The first dimension represents the number of frames that need to be converted. This representation makes it suitable for concatenating each potential pillar in the channel dimension of the hidden feature.
Finally, the hidden state features are fed into a 2D convolutional layer, which is adapted to the transformation process. A key aspect to note is that performing a convolution does not guarantee that the transformation will take place. Channel concatenation simply provides the network with additional information about how the transformation might be performed. In this case, the use of assisted learning is appropriate. During training, an additional learning objective (coordinate transformation) is added in parallel with the main objective (object detection). An auxiliary task is designed whose purpose is to guide the network through the transformation process under supervision to ensure the correctness of the compensation. The auxiliary task is limited to the training process. Once the network learns to transform features correctly, it loses its applicability. Therefore, this task is not considered during inference. In the next section further experiments will be conducted to compare the impact.
The experimental results show that the TimePillars model performs well when processing the Zenseact Open Dataset (ZOD) frame data set, especially This is when dealing with ranges up to 120 meters. These results highlight the performance differences of TimePillars under different motion transformation methods and compare with other methods.
After comparing the benchmark model PointPillars and multi-frame (MF) PointPillars, it can be seen that TimePillars has achieved significant improvements in multiple key performance indicators. Especially on NuScenes Detection Score (NDS), TimePillars demonstrates a higher overall score, reflecting its advantages in detection performance and positioning accuracy. In addition, TimePillars also achieved lower values in average conversion error (mATE), average scale error (mASE) and average orientation error (mAOE), indicating that it is more precise in positioning accuracy and orientation estimation. Of particular note is that the different implementations of TimePillars in terms of motion conversion have a significant impact on performance. When using convolution-based motion transformation (Conv-based), TimePillars performs particularly well on NDS, mATE, mASE, and mAOE, proving the effectiveness of this method in motion compensation and improving detection accuracy. In contrast, TimePillars using the interpolation method also outperforms the baseline model, but is inferior to the convolution method in some indicators. The average precision (mAP) results show that TimePillars performs well in the detection of vehicles, cyclists and pedestrian categories, especially when dealing with more challenging categories such as cyclists and pedestrians, its performance improvement is more significant. From the perspective of processing frequency (f (Hz)), although TimePillars are not as fast as single-frame PointPillars, they are faster than multi-frame PointPillars while maintaining high detection performance. This shows that TimePillars can effectively perform long-distance detection and motion compensation while maintaining real-time processing. In other words, the TimePillars model shows significant advantages in long-distance detection, motion compensation, and processing speed, especially when processing multi-frame data and using convolution-based motion conversion technology. These results highlight the application potential of TimePillars in the field of 3D lidar object detection for autonomous vehicles.
The above experimental results show that the TimePillars model performs excellently in object detection performance in different distance ranges, especially compared with the benchmark model PointPillars. These results are divided into three main detection ranges: 0 to 50 meters, 50 to 100 meters and above 100 meters.
First of all, NuScenes Detection Score (NDS) and average precision (mAP) are the overall performance indicators. TimePillars outperforms PointPillars on both metrics, showing overall higher detection capabilities and positioning accuracy. Specifically, TimePillars' NDS is 0.723, which is much higher than PointPillars' 0.657; in terms of mAP, TimePillars also significantly surpasses PointPillars' 0.475 with 0.570.
In the performance comparison within different distance ranges, it can be seen that TimePillars performs better in each range. For the vehicle category, the detection accuracy of TimePillars in the ranges of 0 to 50 meters, 50 to 100 meters and more than 100 meters is 0.884, 0.776 and 0.591 respectively, which are all higher than the performance of PointPillars in the same range. This shows that TimePillars has higher accuracy in vehicle detection, both at close and far distances. TimePillars also demonstrated better detection performance when dealing with vulnerable vehicles (such as motorcycles, wheelchairs, electric scooters, etc.). Especially in the range of more than 100 meters, the detection accuracy of TimePillars is 0.178, while PointPillars is only 0.036, showing significant advantages in long-distance detection. For pedestrian detection, TimePillars also showed better performance, especially in the range of 50 to 100 meters, with a detection accuracy of 0.350, while PointPillars was only 0.211. Even at longer distances (more than 100 meters), TimePillars still achieves a certain level of detection (accuracy of 0.032), while PointPillars perform zero at this range.
These experimental results highlight the superior performance of TimePillars in handling object detection tasks in different distance ranges. Whether at close range or at the more challenging long range, TimePillars provide more accurate and reliable detection results, which are critical to the safety and efficiency of autonomous vehicles.
First of all, the main advantage of the TimePillars model is its effectiveness for long-distance object detection. By adopting dynamic voxelization and convolutional GRU structure, the model is better able to handle sparse lidar data, especially in long-distance object detection. This is critical for the safe operation of autonomous vehicles in complex and changing road environments. In addition, the model also shows good performance in terms of processing speed, which is essential for real-time applications. On the other hand, TimePillars adopts a convolution-based method for Motion Compensation, which is a major improvement over traditional methods. This approach ensures the correctness of the transformation through auxiliary tasks during training, improving the accuracy of the model when handling moving objects.
However, the research of the paper also has some limitations. First, while TimePillars performs well at handling distant object detection, this performance increase may come at the expense of some processing speed. While the speed of the model is still suitable for real-time applications, it is still a decrease compared to single-frame methods. In addition, the paper mainly focuses on LiDAR data and does not consider other sensor inputs, such as cameras or radars, which may limit the application of the model in more complex multi-sensor environments.
That is to say, TimePillars has shown significant advantages in 3D lidar object detection for autonomous vehicles, especially in long-distance detection and Motion Compensation. Despite the slight trade-off in processing speed and limitations in processing multi-sensor data, TimePillars still represents an important advance in this field.
This work demonstrates that considering past sensor data is superior to utilizing only current information. Accessing previous driving environment information can cope with the sparse nature of lidar point clouds and lead to more accurate predictions. We demonstrate that recurrent networks are suitable as a means to achieve the latter. Giving the system memory leads to a more robust solution compared to point cloud aggregation methods that create denser data representations through extensive processing. The method we proposed, TimePillars, implements a way to solve the recursive problem. By simply adding three additional convolutional layers to the inference process, we demonstrate that basic network building blocks are sufficient to achieve significant results and ensure that existing efficiency and hardware integration specifications are met. To the best of our knowledge, this work provides the first benchmark results for the 3D object detection task on the newly introduced Zenseact open dataset. We hope our work can contribute to safer, more sustainable roads in the future.
The above is the detailed content of TimePillars: Where can the pure LiDAR 3D detection route be extended? Direct coverage of 200m!. For more information, please follow other related articles on the PHP Chinese website!