Start with a novel 3D MLLM architecture that uses sparse queries to lift and compress visual representations into 3D, which are then fed into the LLM.
Title: OmniDrive: A Holistic LLM-Agent Framework for Autonomous Driving with 3D Perception Reasoning and Planning
Author affiliation: Beijing Institute of Technology, NVIDIA, Huazhong University of Science and Technology
Open source Address: GitHub - NVlabs/OmniDrive
The development of multimodal large language models (MLLMs) has led to growing interest in LLM-based autonomous driving, leveraging their powerful inference capabilities. Leveraging the powerful reasoning capabilities of MLLMs to improve planning behavior is challenging because they require full 3D situation awareness beyond 2D reasoning. To address this challenge, this work proposes OmniDrive, a comprehensive framework for robust alignment between agent models and 3D driving tasks. The framework starts with a novel 3D MLLM architecture that uses sparse queries to lift and compress observation representations into 3D, which are then fed into the LLM. This query-based representation allows us to jointly encode dynamic objects and static map elements (e.g., traffic roads), providing a concise world model for perception-action alignment in 3D. We further propose a new benchmark that includes comprehensive visual question answering (VQA) tasks including scene description, traffic rules, 3D grounding, counterfactual reasoning, decision making, and planning. Extensive research demonstrates OmniDrive's superior reasoning and planning capabilities in complex 3D scenes.
The above is the detailed content of OmniDrive: A framework for aligning large models with 3D driving tasks. For more information, please follow other related articles on the PHP Chinese website!