Tesla is a typical AI company. It has trained 75,000 neural networks in the past year, which means a new model is produced every 8 minutes. A total of 281 models use Tesla cars. superior. Next, we will interpret Tesla FSD’s algorithm and model progress in several aspects.
One of Tesla’s key technologies in perception this year is Occupancy Network. Students who study robotics will definitely be familiar with the occupation grid. Occupancy indicates whether each 3D voxel (voxel) in the space is occupied. It can be a binary representation of 0/1 or one between [0, 1]. probability value.
Why is estimation of occupancy important for autonomous driving perception? Because during driving, in addition to common obstacles such as vehicles and pedestrians, we can estimate their positions and sizes through 3D object detection. There are also more long-tail obstacles that will also have an important impact on driving. For example: 1. Deformable obstacles, such as two-section trailers, are not suitable to be represented by 3D bounding boxes; 2. Special-shaped obstacles, such as overturned vehicles, 3D attitude estimation will be invalid; 3. Not in known categories Obstacles such as stones and garbage on the road cannot be classified. Therefore, we hope to find a better expression to describe these long-tail obstacles and fully estimate the occupancy of each position in the 3D space, even the semantics and movement (flow).
Tesla uses the specific example in the figure below to demonstrate the power of Occupancy Network. Unlike 3D boxes, the representation of occupation does not make too many geometric assumptions about the object, so it can model objects of any shape and any form of object motion. The figure shows a scene where a two-section bus is starting. Blue represents moving voxels and red represents stationary voxels. The Occupancy Network accurately estimates that the first section of the bus has started to move, while the second section of the bus has started to move. The section is still at rest.
Occupancy estimation of two buses starting, blue represents moving voxels, red represents stationary voxels
#The model structure of Occupancy Network is shown in the figure below. First, the model uses RegNet and BiFPN to obtain features from multiple cameras. This structure is consistent with the network structure shared at last year's AI day, indicating that the backbone has not changed much. The model then performs attention-based multi-camera fusion on 2D image features through spatial query with 3D spatial position. How to realize the connection between 3D spatial query and 2D feature map? The specific fusion method is not detailed in the figure, but there are many public papers for reference. I think the most likely solution is one of two solutions. The first one is called 3D-to-2D query, which projects the 3D spatial query onto the 2D feature map based on the internal and external parameters of each camera to extract the features of the corresponding position. This method was proposed in DETR3D, and BEVFormer and PolarFormer also adopted this idea. The second is to use positional embedding to perform implicit mapping, that is, add reasonable positional embedding to each position of the 2D feature map, such as camera internal and external parameters, pixel coordinates, etc., and then let the model learn the correspondence between 2D and 3D features by itself. . Next, the model undergoes time-series fusion. The implementation method is to splice the 3D feature space based on the known position and attitude changes of the self-vehicle.
##Occupancy Network structure
After feature fusion, a deconvolution-based The decoder will decode the occupation, semantics and flow of each 3D space position. The press conference emphasized that because the output of this network is dense, the output resolution will be limited by memory. I believe this is also a major headache for all students who do image segmentation. What's more, what we are doing here is 3D segmentation, but autonomous driving has very high resolution requirements (~10cm). Therefore, inspired by neural implicit representation, an implicit queryable MLP decoder is designed at the end of the model. By inputting any coordinate value (x, y, z), the information of the spatial position can be decoded, that is, occupation, semantics, flow. This method breaks the limitation of model resolution, which I think is a highlight of the design.
Planning is another important module of autonomous driving. Tesla this time mainly emphasizes interaction at complex intersections. ) for modeling. Why is interaction modeling so important? Because there is a certain degree of uncertainty in the future behavior of other vehicles and pedestrians, a smart planning module needs to predict multiple interactions between self-vehicles and other vehicles online, and evaluate the risks brought by each interaction, and finally Decide what strategy to pursue.
Tesla calls the planning model they adopt Interaction Search, which mainly consists of three main steps: tree search, neural network trajectory planning and trajectory scoring.
1. Tree search is a commonly used algorithm for trajectory planning. It can effectively discover various interactive situations and find the optimal solution. However, using the search method to solve trajectory planning problems encounters the biggest problem. The difficulty is that the search space is too large. For example, there may be 20 vehicles related to oneself at a complex intersection, which can be combined into more than 100 interaction methods, and each interaction method may have dozens of spatio-temporal trajectories as candidates. Therefore, Tesla did not use the trajectory search method, but used a neural network to score the target positions (goals) that may be reached after a period of time and obtain a small number of better targets.
2. After determining the target, we need to determine a trajectory to reach the target. Traditional planning methods often use optimization to solve this problem. It is not difficult to solve the optimization problem. Each optimization takes about 1 to 5 milliseconds. However, when there are many candidate targets given by the tree search in the previous steps, we cannot solve the problem in terms of time cost. burden. Therefore, Tesla proposed using another neural network for trajectory planning to achieve highly parallel planning for multiple candidate targets. There are two sources of trajectory labels for training this neural network: the first is the trajectory of real human driving, but we know that the trajectory of human driving may be only one of many better solutions, so the second source is through offline optimization Other trajectory solutions produced by the algorithm.
3. After obtaining a series of feasible trajectories, we need to choose an optimal solution. The solution adopted here is to score the obtained trajectory. The scoring solution combines artificially formulated risk indicators, comfort indicators, and a neural network scorer.
Through the decoupling of the above three steps, Tesla has implemented an efficient trajectory planning module that takes interaction into account. There are not many papers that can be referenced for trajectory planning based on neural networks. I have published a paper TNT [5] that is relatively related to this method. It also decomposes the trajectory prediction problem into the above three steps to solve: target scoring, Trajectory planning, trajectory scoring. Interested readers can check out the details. In addition, our research group has been exploring issues related to behavioral interaction and planning, and everyone is welcome to pay attention to our latest work InterSim[6].
Interaction Search Planning Model Structure
Personally, I think another major technical highlight of this AI Day is the online vector map construction model Lanes Network. Students who paid attention to AI Day last year may remember that Tesla conducted complete online segmentation and recognition of maps in the BEV space. So why do we still want to build Lanes Network? Because the segmented pixel-level lanes are not enough for trajectory planning, we also need to get the topology of the lane lines to know that our car can change from one lane to another.
Let’s first take a look at what a vector map is. As shown in the figure, Tesla’s vector map consists of a series of blue lane centerlines and some key points (connection points connection, fork point, merge point), and their connection relationship is expressed in the form of graph.
Vector map, the dots are the key points of the lane line, and the blue is the center line of the lane
Lanes Network is a decoder based on the backbone of the perceptual network in terms of model structure. Compared with decoding the occupancy and semantics of each voxel, it is more difficult to decode a series of sparse, connected lane lines because the number of outputs is not fixed, and there are logical relationships between the output quantities.
Tesla refers to the Transformer decoder in the natural language model and outputs the results autoregressively in a sequential manner. In terms of specific implementation, we must first select a generation order (such as from left to right, top to bottom) and discretize the space (tokenization). Then we can use Lanes Network to predict a series of discrete tokens. As shown in the figure, the network will first predict the rough position (index: 18) and precise position (index: 31) of a node, then predict the semantics of the node ("Start", which is the starting point of the lane line), and finally predict the connection Characteristics, such as bifurcation/merging/curvature parameters, etc. The network will generate all lane line nodes in this autoregressive manner.
##Lanes Network network structure
We should note that autoregression Sequence generation is not patented by the language Transformer model. Our research group has also published two related papers on generating vector maps in the past few years, HDMapGen[7] and VectorMapNet[8]. HDMapGen uses the graph neural network with attention (GAT) to autoregressively generate the key points of the vector map, which is similar to Tesla's solution. VectorMapNet uses Detection Transformer (DETR) to solve this problem, using a set prediction solution to generate vector maps more quickly.
HDMapGen vector map generation result
VectorMapNet vector map generation results
Auto labeling is also Tes La is a technology that was explained at last year’s AI Day. This year’s automatic annotation focuses on the automatic annotation of Lanes Network. Tesla vehicles can generate 500,000 driving journeys (trips) every day, and making good use of this driving data can better help predict lane lines.
Tesla’s automatic lane marking has three steps:
1. Through visual inertial odometry technology, High-precision trajectory estimation for all journeys.
2. Map reconstruction of multiple vehicles and multiple journeys is the most critical step in this plan. The basic motivation for this step is that different vehicles may observe the same location from different spatial angles and times, so aggregating this information can lead to better map reconstruction. The technical points of this step include geometric matching between maps and joint optimization of results.
3. Automatically mark lanes for new journeys. When we have high-precision offline map reconstruction results, when a new journey occurs, we can perform a simple geometric matching to obtain the pseudo-true value (pseudolabel) of the lane line of the new journey. This method of obtaining pseudo-true values is sometimes even better than manual annotation (at night, rainy and foggy days).
##Lanes Network automatically annotates The simulation of visual images has been a popular direction in computer vision in recent years. In autonomous driving, the main purpose of visual simulation is to generate some rare scenes in a targeted manner, thereby eliminating the need to try your luck in real road tests. For example, Tesla has always had a headache with the scene of a large truck lying in the middle of the road. But visual simulation is not a simple problem. For a complex intersection (Market Street in San Francisco), the solution using traditional modeling and rendering requires the designer 2 weeks. Tesla’s AI-based solution now only takes 5 minutes. Visual simulation reconstructed intersection Specifically, visual simulation The prerequisite is to prepare automatically labeled real-world road information and a rich graphics material library. Then proceed to the following steps in sequence: 1. Pavement generation: Fill the road surface according to the curb, including road slope, material and other detailed information. 2. Lane line generation: draw lane line information on the road surface. 3. Plant and building generation: Randomly generate and render plants and houses between roads and roadsides. The purpose of generating plants and buildings is not only for visual beauty, it also simulates the occlusion effect caused by these objects in the real world. 4. Generate other road elements: such as traffic lights, street signs, and import lanes and connection relationships. 5. Add dynamic elements such as vehicles and pedestrians. Finally, let’s briefly talk about the foundation of Tesla’s series of software technologies, which is powerful infrastructure . Tesla’s supercomputing center has 14,000 GPUs and a total of 30PB of data cache, and 500,000 new videos flow into these supercomputers every day. In order to process this data more efficiently, Tesla has specially developed an accelerated video decoding library, as well as a file format .smol file format that accelerates reading and writing intermediate features. In addition, Tesla has also developed its own chip Dojo for the supercomputing center, which we will not explain here. Super computing center for video model training With the release of Tesla AI Day content in the past two years, we have slowly seen Tesla’s technical landscape in the direction of automatic (assisted) driving. At the same time, we We also see that Tesla itself is constantly iterating on itself, such as from 2D perception, BEV perception, to Occupancy Network. Autonomous driving is a long journey of thousands of miles. What is supporting the evolution of Tesla's technology? I think there are three points: full scene understanding capabilities brought by visual algorithms, model iteration speed supported by powerful computing power, and generalization brought by massive data. Aren’t these the three pillars of the deep learning era?
05 Simulation
06 Infrastructure
07 Summary
The above is the detailed content of Interpretation of Tesla's autonomous driving algorithms and models. For more information, please follow other related articles on the PHP Chinese website!