


UniPAD: Universal autonomous driving pre-training mode! Various perception tasks can be supported
Recently, the speed at which new papers are being published has been so fast that I feel like I can’t read them. It can be seen that the fusion of multi-modal large models for language and vision has become an industry consensus. This article on UniPad is more representative, with multi-modal input and a pre-trained base model of world-like models, while being easy to expand. to multiple traditional vision applications. It also solves the problem of applying the pre-training method of large language models to 3D scenes, thus providing the possibility of a unified large model of perceptual base.
UniPAD is a self-supervised learning method based on MAE and 3D rendering that can train a base model with excellent performance and then fine-tune and train downstream tasks on the model, such as depth estimation, object detection and segmentation. This study designed a unified 3D space representation method that can be easily integrated into 2D and 3D frameworks, showing greater flexibility and consistent with the positioning of the base model
Thinking and Reading While Reading Question:
What is the relationship between mask auto-encoding technology and 3D differentiable rendering technology? To put it simply: Masked autoencoding is to take advantage of Autoencoder’s self-supervised training capabilities, and rendering technology is to calculate the loss function between the generated image and the original image and conduct supervised training. So the logic is still very clear.
This article uses the base model pre-training method, and then fine-tunes the downstream detection method and segmentation method. This method can also help understand how the current large model works with downstream tasks.
Looks like is not combined with timing information. After all, NuScenes NDS of Pure Vision 50.2 is currently still weaker in comparison with timing detection methods (StreamPETR, Sparse4D, etc.). Therefore, the 4D MAE method is also worth trying. In fact, GAIA-1 has already mentioned a similar idea.
How about the calculation amount and memory usage?
Specific method:
UniPAD implicitly encodes 3D spatial information. This is mainly inspired by mask auto-encoding (MAE, VoxelMAE, etc.). This article uses A generative mask is used to complete the enhancement of voxel features, which is used to reconstruct the continuous 3D shape structure in the scene and their complex appearance features on the 2D plane.
Our experimental results fully prove the superiority of UniPAD. Compared with traditional lidar, camera and lidar-camera fusion baselines, UniPAD's NDS improves by 9.1, 7.7 and 6.9 respectively. It is worth noting that on the nuScenes validation set, our pre-training pipeline achieved an NDS of 73.2, while achieving an mIoU score of 79.4 on the 3D semantic segmentation task, achieving the best results compared with previous methods
Overall architecture:
Overall architecture. The framework takes LiDar and multi-shot images as input, and these multi-modal data are filled with zeros through the Mask Generator. The masked embedding is converted to voxel space, and rendering techniques are used to generate RGB or depth predictions in this 3D space. At this time, the original image that is not obscured by the mask can be used as generated data for supervised learning.
Mask Generator
The mask in Masked AutoEncoder is generated by Mask Generator. It can be understood as improving the representation ability and generalization ability of the model by increasing the difficulty of training. A Mask generator is introduced to differentiate between point cloud data and image data by selectively occluding certain areas. In the point cloud data, the block masking strategy is adopted; for the image data, the sparse convolution method is used, and calculations are only performed in the visible area. When the input data is masked, the subsequent encoding features will be set to 0 in the corresponding masked area and ignored in the model processing. It also provides subsequent supervised learning with information that can be used to predict the target and the corresponding Groundtruth information
Unified representation
In order to make the pre-training method applicable to various data modalities, it is important to find a unified representation. Past methods such as BEV and OCC are looking for a unified form of identification. Projecting 3D points into the image plane will lead to the loss of depth information, and merging them into the BEV bird's-eye view will miss height-related details. Therefore, this article proposes to convert both modalities into 3D volume space, which is a 3D voxel space similar to OCC
Rendering method:
Differentiable rendering technology This should be the biggest highlight of the paper according to the author. This paper uses NERF-like sampling rays to pass through multi-view images or point clouds, predict the color or depth of each 3D point through the neural network structure, and finally obtain 2D data through the path of the rays. of mapping. This can better utilize geometric or texture clues in images and improve the model's learning ability and application range.
We represent the scene as SDF (implicit signed distance function field). When the input is the 3D coordinate P of the sampling point (the corresponding depth D along the ray) and F (the feature embedding can be extracted from the volumetric representation by trilinear interpolation ), SDF can be regarded as an MLP to predict the SDF value of the sampling point. Here F can be understood as the encode code where point P is located. Then the output is obtained: N (condition the color field on the surface normal) and H (geometry feature vector). At this time, the RGB of the 3D sampling point can be obtained through an MLP with P, D, F, N, H as input. value and depth value, and then superimpose the 3D sampling points to the 2D space through rays to obtain the rendering result. The method of using Ray here is basically the same as that of Nerf.
The rendering method also needs to optimize the memory consumption, which is not listed here. However, this issue is a more critical implementation issue.
The essence of the Mask and rendering methods is to train a pre-trained model. The pre-trained model can be trained based on the predicted mask, even without subsequent branches. The subsequent work of the pre-training model generates RGB and depth predictions through different branches, and fine-tunes tasks such as target detection/semantic segmentation to achieve plug-and-play capabilities
Loss loss function:
Loss function is not complicated.
Experimental results:
# #Comparison with other recent work:
In fact, GAIA-1 is already using the Mask AutoEncoder idea in timing, but the supervision data is a whole frame of data at different times, but UniPAD is using Randomly pick out a part of the mask in the 3D space to supervise the prediction. I'm really looking forward to seeing a way to combine the two. In addition, UniPAD can be regarded as an attempt at a multi-modal large model, and it can also be regarded as a world model. Although the article does not emphasize these very much.Summary:
This article should be regarded as a relatively new Masked Autoencoder method in the 3D field. Because the MAE method is used in the pre-training stage of the base model, it supports multiple different modalities of information, so it can naturally be extended to many downstream fine-tuning tasks. This is very close to the design idea of LLM, which focuses on The pre-training stage captures multi-modal information and provides a unified basis for various tasks. This method provides new ideas and possibilities for research in the 3D field. This method not only has potential in the 3D field, but can also be extended to the 4D timing field, and can also generate a lot of new work in terms of optimizing its memory and calculation volume, providing new ideas and insights for future research. possibility.The above is the detailed content of UniPAD: Universal autonomous driving pre-training mode! Various perception tasks can be supported. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



0.What does this article do? We propose DepthFM: a versatile and fast state-of-the-art generative monocular depth estimation model. In addition to traditional depth estimation tasks, DepthFM also demonstrates state-of-the-art capabilities in downstream tasks such as depth inpainting. DepthFM is efficient and can synthesize depth maps within a few inference steps. Let’s read about this work together ~ 1. Paper information title: DepthFM: FastMonocularDepthEstimationwithFlowMatching Author: MingGui, JohannesS.Fischer, UlrichPrestel, PingchuanMa, Dmytr

Yesterday during the interview, I was asked whether I had done any long-tail related questions, so I thought I would give a brief summary. The long-tail problem of autonomous driving refers to edge cases in autonomous vehicles, that is, possible scenarios with a low probability of occurrence. The perceived long-tail problem is one of the main reasons currently limiting the operational design domain of single-vehicle intelligent autonomous vehicles. The underlying architecture and most technical issues of autonomous driving have been solved, and the remaining 5% of long-tail problems have gradually become the key to restricting the development of autonomous driving. These problems include a variety of fragmented scenarios, extreme situations, and unpredictable human behavior. The "long tail" of edge scenarios in autonomous driving refers to edge cases in autonomous vehicles (AVs). Edge cases are possible scenarios with a low probability of occurrence. these rare events

Boston Dynamics Atlas officially enters the era of electric robots! Yesterday, the hydraulic Atlas just "tearfully" withdrew from the stage of history. Today, Boston Dynamics announced that the electric Atlas is on the job. It seems that in the field of commercial humanoid robots, Boston Dynamics is determined to compete with Tesla. After the new video was released, it had already been viewed by more than one million people in just ten hours. The old people leave and new roles appear. This is a historical necessity. There is no doubt that this year is the explosive year of humanoid robots. Netizens commented: The advancement of robots has made this year's opening ceremony look like a human, and the degree of freedom is far greater than that of humans. But is this really not a horror movie? At the beginning of the video, Atlas is lying calmly on the ground, seemingly on his back. What follows is jaw-dropping

I cry to death. The world is madly building big models. The data on the Internet is not enough. It is not enough at all. The training model looks like "The Hunger Games", and AI researchers around the world are worrying about how to feed these data voracious eaters. This problem is particularly prominent in multi-modal tasks. At a time when nothing could be done, a start-up team from the Department of Renmin University of China used its own new model to become the first in China to make "model-generated data feed itself" a reality. Moreover, it is a two-pronged approach on the understanding side and the generation side. Both sides can generate high-quality, multi-modal new data and provide data feedback to the model itself. What is a model? Awaker 1.0, a large multi-modal model that just appeared on the Zhongguancun Forum. Who is the team? Sophon engine. Founded by Gao Yizhao, a doctoral student at Renmin University’s Hillhouse School of Artificial Intelligence.

In the past month, due to some well-known reasons, I have had very intensive exchanges with various teachers and classmates in the industry. An inevitable topic in the exchange is naturally end-to-end and the popular Tesla FSDV12. I would like to take this opportunity to sort out some of my thoughts and opinions at this moment for your reference and discussion. How to define an end-to-end autonomous driving system, and what problems should be expected to be solved end-to-end? According to the most traditional definition, an end-to-end system refers to a system that inputs raw information from sensors and directly outputs variables of concern to the task. For example, in image recognition, CNN can be called end-to-end compared to the traditional feature extractor + classifier method. In autonomous driving tasks, input data from various sensors (camera/LiDAR

What? Is Zootopia brought into reality by domestic AI? Exposed together with the video is a new large-scale domestic video generation model called "Keling". Sora uses a similar technical route and combines a number of self-developed technological innovations to produce videos that not only have large and reasonable movements, but also simulate the characteristics of the physical world and have strong conceptual combination capabilities and imagination. According to the data, Keling supports the generation of ultra-long videos of up to 2 minutes at 30fps, with resolutions up to 1080p, and supports multiple aspect ratios. Another important point is that Keling is not a demo or video result demonstration released by the laboratory, but a product-level application launched by Kuaishou, a leading player in the short video field. Moreover, the main focus is to be pragmatic, not to write blank checks, and to go online as soon as it is released. The large model of Ke Ling is already available in Kuaiying.

Written in front & starting point The end-to-end paradigm uses a unified framework to achieve multi-tasking in autonomous driving systems. Despite the simplicity and clarity of this paradigm, the performance of end-to-end autonomous driving methods on subtasks still lags far behind single-task methods. At the same time, the dense bird's-eye view (BEV) features widely used in previous end-to-end methods make it difficult to scale to more modalities or tasks. A sparse search-centric end-to-end autonomous driving paradigm (SparseAD) is proposed here, in which sparse search fully represents the entire driving scenario, including space, time, and tasks, without any dense BEV representation. Specifically, a unified sparse architecture is designed for task awareness including detection, tracking, and online mapping. In addition, heavy

Target detection is a relatively mature problem in autonomous driving systems, among which pedestrian detection is one of the earliest algorithms to be deployed. Very comprehensive research has been carried out in most papers. However, distance perception using fisheye cameras for surround view is relatively less studied. Due to large radial distortion, standard bounding box representation is difficult to implement in fisheye cameras. To alleviate the above description, we explore extended bounding box, ellipse, and general polygon designs into polar/angular representations and define an instance segmentation mIOU metric to analyze these representations. The proposed model fisheyeDetNet with polygonal shape outperforms other models and simultaneously achieves 49.5% mAP on the Valeo fisheye camera dataset for autonomous driving
