In the past month, due to some well-known reasons, I have had very intensive exchanges with various teachers and classmates in the industry. An inevitable topic in the exchange is naturally end-to-end and the popular Tesla FSD V12. I would like to take this opportunity to sort out some of my thoughts and opinions at this moment for your reference and discussion.
According to the most traditional definition, an end-to-end system refers to a system that inputs raw information from sensors and directly outputs variables of concern to the task. For example, in image recognition, the CNN method compared to the traditional feature extractor classifier can be called end-to-end. In autonomous driving tasks, data from various sensors (cameras/LiDAR/Radar/IMU, etc.) are input and control signals for vehicle driving (throttle/steering wheel angle, etc.) are directly output. In order to consider the adaptation problem between different models, the output can also be relaxed to the trajectory of the vehicle. This is a definition in the traditional sense, or what I call a narrow end-to-end definition. On this basis, some intermediate task supervision has also been derived to improve performance capabilities.
However, in addition to such a narrow definition, we should also think about it essentially, what is the essence of end-to-end? I think the essence of end-to-end should be the lossless transmission of sensory information. Let's first recall what the interface between sensing and PnC modules looks like in a non-end-to-end system. Generally, we will have detection/attribute analysis/prediction for whitelist objects (cars, people, etc.), and understanding of the static environment (road structure/speed limit/traffic lights, etc.). If we do it more carefully, We will also do some detection work for general obstacles. From a macro perspective, the information output by perception is an abstraction of complex driving scenarios, and it is an explicit abstraction defined manually. However, for some unusual scenarios, the current explicit abstraction cannot fully express the factors that affect driving behavior in the scene, or the tasks we need to define are too many and too trivial, and it is difficult to enumerate all required tasks. Therefore, the end-to-end system provides a (perhaps implicit) comprehensive representation, hoping to automatically and losslessly apply such information to the PnC. I think that all systems that can meet such requirements can be called generalized end-to-end.
As for other problems, such as some optimizations of dynamic interaction scenarios, my personal opinion is that at least not only end-to-end can solve these problems. Traditional methods can solve these problems. Of course, when the amount of data is large enough, end-to-end may provide a pretty good solution. Whether this is necessary will be discussed in the next few questions.
Be sure to output control signals and waypoints to be end-to-end
For the concept of generalized end-to-end, if you can agree with the concept mentioned above, Then this problem is easy to understand. End-to-end emphasizes the lossless transmission of information, rather than directly outputting the task volume. Such an end-to-end processing method requires a large number of covert solutions to ensure security, and will also encounter some problems during the implementation process, which will gradually unfold in subsequent processing.
The end-to-end system must be based on large models or pure vision
The concept of end-to-end autonomous driving and large model automation There is no necessary connection between driving and purely visual autonomous driving. These three concepts exist completely independently. An end-to-end system does not have to be driven by a large model in the traditional sense, nor is it necessarily purely visual. There are some connections between the three, but they are not equivalent.
I have a previous article that elaborated on the relationship between these concepts. For details, see: https://zhuanlan.zhihu.com/p/664189972
In the long run, Is it possible for the above-mentioned end-to-end system in a narrow sense to achieve autonomous driving above L3 level?
Actually, I want to make a complaint first. Those who claim to use large models to subvert L4 have never actually done L4; those who claim to be end-to-end cure all diseases have never done PnC. So after chatting with many people who are enthusiastic about end-to-end, it turned into a purely religious dispute that cannot be verified or falsified. We students who are engaged in cutting-edge research and development should still be more pragmatic and pay attention to evidence. . . At the very least, you should have some basic knowledge of what you want to subvert and understand the thorny issues involved. This is the basic scientific quality you should have. . .
Getting back to the subject, at present, I am pessimistic. Regardless of the fact that the FSD currently claims to be purely end-to-end, its performance is far from reaching the reliability and stability required above the L3 level. In the future, even if this vehicle is statistically as safe as a human, it will still have to face how to be as safe as a human. Driver's error in aligning. To put it more bluntly, if an autonomous driving system wants to be accepted by the public and public opinion, the key may not lie in an absolute accident rate and fatality rate, but in whether the public can accept that there are some scenarios that are harmful to humans. Relatively easy to solve, whereas machines make mistakes. This requirement is more difficult to achieve for a pure end-to-end system. More specifically, it was explained in an answer I gave in 2021. For details, see:
How to view Robin Li’s Moments post: Driverless driving will definitely cause an accident, but the probability is much lower than that of manned driving?
https://www.zhihu.com/question/530828899/answer/2590673435?utm_psn=1762524415009697792
Take Waymo and Cruise in North America as examples. In fact, they have produced many products respectively. Accidents, but why was Cruise’s last accident so unacceptable to regulators and the public? This accident caused two injuries. The first collision was quite difficult for human drivers to avoid, but it was actually acceptable. However, after this collision, serious secondary injuries occurred: the system misjudged the location of the collision and the location of the injured. In order not to block traffic, it downgraded to pull-over mode and dragged the injured for a long time. Such a behavior is something that no normal human driver would do, and the impact is very bad. This incident directly led to some subsequent turmoil in Cruise. This incident actually sounded the alarm for us. How to avoid such things from happening should be a serious consideration in the development and operation of autonomous driving systems.
So at this moment, what are the practical solutions for the next generation of mass-produced assisted driving systems?
To put it simply, I think a suitable system should first fully explore the upper limit of the capabilities of the traditional system, and then combine it with end-to-end flexibility and universality, which is a gradual An end-to-end solution. Of course, how to combine the two organically is a paid content, haha. . . But we can analyze what the so-called end-to-end or learning based planner is actually doing now.
Based on my limited understanding, when the so-called end-to-end model is used in driving, the output trajectory will be followed by a solution based on traditional methods, or such a learning based planner and traditional The trajectory planning algorithm will output multiple trajectories at the same time, and then select one for execution through a selector. If the system architecture is designed in this way, the upper limit of performance of such a cascade system is actually limited by such a cover-up plan and selector. If such a solution is still based on pure feedforward learning, there will still be unpredictable failures, which essentially cannot achieve the purpose of being safe. If you consider using a traditional planning method to optimize or select on such an output trajectory, it is equivalent to the trajectory produced by the learning based method. is just an initial solution to such an optimization and search problem. Why do we Why not directly optimize and search for such trajectories?
Of course, some students will jump out and say that such an optimization or search problem is non-convex, and the state space is too large to run in real time on the vehicle system. I ask everyone to think carefully about this question here: In the past 10 years, the perception system has enjoyed at least 100x computing power dividend development, but what about our PnC module? If we also allow the PnC module to use large computing power, combined with some developments in advanced optimization algorithms in recent years, will this conclusion still hold? In response to such problems, we should not rest on our laurels and rely on paths, but should think about what is right from first principles.
In fact, an example that is very similar to autonomous driving is playing chess. Just in February this year, Deepmind published an article (Grandmaster-Level Chess Without Search: https://arxiv.org /abs/2402.04494) is exploring whether it is feasible to use only data-driven and abandon MCTS search in AlphaGo and AlphaZero. An analogy to autonomous driving is that only one network is used to directly output actions, discarding all subsequent steps. The conclusion of the article is that under a considerable scale of data and model parameters, a reasonable result can be obtained without searching. However, compared with the method plus search, there is still a very significant gap. (The comparison here in the article is actually not fair. The actual gap should be even greater.) Especially when it comes to solving some difficult endgames, pure data-driven performance is very poor. This analogy to autonomous driving means that in difficult scenarios or corner cases that require multi-step games, it is still difficult to completely abandon traditional optimization or search algorithms. Reasonably utilizing the advantages of various technologies like AlphaZero is the most efficient way to improve performance.
This concept also needs to be corrected repeatedly in my interactions with many people. According to many people's definition, as long as it is not purely data-driven, it is called rule based. Let’s take the example of playing chess again. Memorizing formulas and chess records by rote is rule based, but if you give the model reasoning capabilities through search and optimization like AlphaGo and AlphaZero, I don’t think it can be called rule based. This is precisely what the current large model itself lacks, and what researchers are trying to give a learning based model through CoT and other methods. However, every action of a person driving has a clear motivation, which is different from tasks such as pure data-driven image recognition that cannot clearly describe the reasons. Under a suitable algorithm architecture design, decision trajectories should become variables and be optimized uniformly under the guidance of a scientific goal. Instead of forcibly applying patches and adjusting parameters to fix various cases. Naturally, such a system will not have strange rules with various hardcodes.
Finally, end-to-end may be a promising technical route, but there is still much to be explored about how such a concept can be put into practice. matter. Is it the only correct solution to pile up data and model parameters? In my opinion, it is not the case at the moment. I feel that as a cutting-edge research technician at any time, we should truly pursue the first principles and engineer thinking mentioned by Musk, and think about the essence of the problem from practice, rather than turning Musk himself into a first principle. principle. If you want to be really far ahead, you should not give up thinking and follow what others say, otherwise you will have to keep trying to overtake in corners.
The above is the detailed content of Let's talk about end-to-end and next-generation autonomous driving systems, as well as some misunderstandings about end-to-end autonomous driving?. For more information, please follow other related articles on the PHP Chinese website!