Peking University Dong Hao TeamEmbodied NavigationThe latest results are here:
No need for additional mapping and training, just speak the navigation instructions, such as:
Walk forward across the room and walk through the panty followed by the kitchen. Stand at the end of the kitchen
We can control the robot to move flexibly.
Here, the robot relies onactively communicating with the "expert team" composed of large modelscompletes command analysis, vision A range of visual language navigation critical tasks such as perception, completion estimation and decision making tests.
The project homepage and papers are currently online, and the code will be released soon:
Visual language navigation involves a series of subtasks, including instruction analysis, visual perception, completion estimation and decision testing.
These key tasks require knowledge in different fields, and they are interrelated and determine the navigation ability of the robot.
Inspired by the actual discussion behavior of experts, Peking University Dong Hao team proposed the DiscussNav navigation system.
The author first assigns expert roles and specific tasks to LLM (Large Language Model) and MLM (Multimodal Large Model) in a prompt manner to activate their domain knowledge and capabilities, thus building a team of visual navigation experts with different specialties.
Then, the author designed a corpus of discussion questions and a discussion mechanism. Following this mechanism, the navigation robot driven by LLM can actively initiate a series of visual interactions. Navigation expert discussion.
Before each move, the navigation robot discusses with experts to understand the required actions and mentions in human instructions object sign.
Then based on the types of these object marks, the surrounding environment is tended to be perceived, the instruction completion status is estimated, and a preliminary movement decision is made.
During the decision-making process, the navigation robot will simultaneously generate N independent When the prediction results are inconsistent, the robot will seek help from decision testing experts to filter out the final mobile decision. We can see from this process that compared to traditional methods, additional pre-training is required. This method guides the robot to move according to human instructions by interacting with large model experts,
directly solves the problem of robot navigation training data The problem of scarcity. Furthermore, it is precisely because of this feature that it also achieves zero-sample capabilities. As long as you follow the above discussion process, you can follow a variety of navigation instructions.
The following is the performance of DiscussionNav on the classic visual language navigation data set Room2Room.
As can be seen, it
is significantly higher than all zero-shot methods, and even exceeds the two trained methods . The author further carried out real indoor scene navigation experiments on the Turtlebot4 mobile robot.
With the powerful language and visual generalization capabilities of large models inspired by expert role-playing and discussions, DiscussNav's performance in the real world is significantly better than the previous optimal zero-shot method and pre-training fine-tuned method. Demonstrates good sim-to-real migration capabilities.
Through experiments, the author further discovered that DiscussNav produced
4 powerful abilities: 1. Identify open world objects, such as "robot arm on white table" and "teddy bear on chair". 2. Identify fine-grained navigation landmark objects, such as "plants on the kitchen counter" and "cartons on the table". 3. Correct the erroneous information replied by other experts in the discussion. For example, the logo extraction expert will check and correct the incorrectly decomposed action sequence before extracting the navigation logo from the navigation action sequence. 4. Eliminate inconsistent movement decisions. For example, decision test experts can select the most reasonable one from multiple inconsistent movement decisions predicted by DiscussNav based on the current environment information as the final movement decision. The corresponding author Dong Hao proposed in a previous report to explore in depth how to effectively use simulation data and large models to learn from massive data Prior knowledge is the development direction of future embodied intelligence research. Currently limited by data scale and the high cost of exploring the real environment, embodied intelligence research will still focus on simulation platform experiments and simulation data training. Recent progress in large models provides a new direction for embodied intelligence. Proper exploration and utilization of language common sense and physical world priors in large models will promote the development of embodied intelligence. Paper address: https://arxiv.org/abs/2309.11382"Simulation and large model priors are Free Lunch"
The above is the detailed content of Peking University's new achievement of embodied intelligence: No training required, you can move flexibly by following instructions. For more information, please follow other related articles on the PHP Chinese website!