


Peking University's embodied intelligence team proposes demand-driven navigation to align human needs and make robots more efficient
Imagine if a robot could understand your needs and work hard to meet them, wouldn’t it be great?
If you want a robot to help you, you usually need to give a more precise command, but the actual implementation of the command may not be ideal. If we consider the real environment, when the robot is asked to find a specific item, the item may not actually exist in the current environment, and the robot cannot find it anyway; but is it possible that there is another item in the environment, which is related to the user? Does the requested item have similar functions and can also meet the user's needs? This is the benefit of using "requirements" as task instructions.
Recently, Peking University Dong Hao’s team proposed a new navigation task - Demand-driven Navigation (DDN), has been accepted by NeurIPS 2023. In this task, the robot is required to find items that meet the user's needs based on a demand instruction given by the user. At the same time, Dong Hao's team also proposed learning the attribute characteristics of items based on demand instructions, which effectively improved the success rate of the robot in finding items.

Paper address: https://arxiv.org/pdf/2309.08138.pdf
Project homepage: https://sites.google.com/view/demand-driven-navigation/home
Users only need to give instructions according to their own needs, without considering what is in the scene.
Using needs as instructions can increase the probability of user needs being met. For example, when you are "thirsty", asking the robot to find "tea" and asking the robot to find "items that can quench your thirst" obviously have a wider scope in the latter.
Requirements described in natural language have a larger description space and can put forward more precise and precise requirements.
In order to train such a robot, it is necessary to establish a mapping relationship between demand instructions and items so that the environment can give training signals. In order to reduce costs, Dong Hao's team proposed a "semi-automatic" generation method based on a large language model: first use GPT-3.5 to generate needs that can be met by items existing in the scene, and then manually filter out those that do not meet the requirements.
- Algorithm design
Considering that items that can meet the same needs have similar attributes, if the characteristics of the attributes of such items can be learned, the robot seems to be able to use these attribute characteristics to find items. For example, for the requirement "I am thirsty", the required items should have the attribute of "quenching thirst", and "juice" and "tea" both have this attribute. What needs to be noted here is that an item may exhibit different attributes under different needs. For example, "water" can exhibit both the attribute of "cleaning clothes" (under the requirement of "washing clothes") and Expose the attribute of "quenching thirst" (under the requirement of "I am thirsty").
Attribute learning stage
So, how to make the model understand the needs of "quenching thirst" and "cleaning clothes"? It is a relatively stable common sense to note the attributes displayed by items under certain needs. In recent years, with the gradual rise of large language models (LLM), the understanding of common sense of human society demonstrated by LLM is amazing. Therefore, Peking University Dong Hao’s team decided to learn this common sense from LLM. They first asked LLM to generate a lot of demand instructions (called Language-grounding Demand, LGD in the figure), and then asked LLM which items can satisfy these demand instructions (called Language-grounding Object, LGO in the figure).
It should be noted here that the prefix Language-grounding emphasizes that these demand/objects can be obtained from LLM and does not depend on a specific scenario; World-grounding in the figure below emphasizes these demand/objects Object is closely integrated with a specific environment (such as ProcThor, Replica and other scene data sets).
Then in order to obtain the properties of LGO under LGD, the authors used BERT to encode LGD, CLIP-Text-Encoder to encode LGO, and then spliced them to obtain Demand-object Features. Noting that there was a "similarity" when introducing the attributes of items at the beginning, the authors used this similarity to define "positive and negative samples" and then used contrastive learning to train "item attributes". Specifically, for two spliced Demand-object Features, if the items corresponding to the two features can meet the same requirement, then the two features are positive samples of each other (for example, both item a and item b in the picture are can meet the requirement D1, then DO1-a and DO1-b are positive samples of each other); any other splicing is negative samples of each other. After the authors input the Demand-object Features into an Attribute Module of the TransformerEncoder architecture, they trained with InfoNCE Loss.
Navigation strategy learning phase
Through comparative learning, the Attribute Module has learned the common sense provided by LLM. In the navigation strategy learning phase, the parameters of the Attribute Module are directly imported, and then the A* algorithm is learned using imitation learning. Collected tracks. At a certain time step, the author uses the DETR model to segment the items in the current field of view to obtain the World-grounding Object, which is then encoded by CLIP-Visual-Endocer. Other processes are similar to the attribute learning stage. Finally, the BERT features, global image features, and attribute features of the required instructions are spliced, fed into a Transformer model, and finally an action is output.
It is worth noting that the authors used CLIP-Text-Encoder in the attribute learning stage, and in the navigation policy learning stage, the authors used CLIP-Visual-Encoder. Here, the powerful visual and text alignment capabilities of the CLIP model are cleverly used to transfer the text common sense learned from LLM to the vision at each time step.
Experimental results
The experiment was conducted on the AI2Thor simulator and ProcThor data sets. The experimental results show that this method is significantly higher than previous variants of various visual item navigation algorithms and algorithms supported by large language models.
VTN is a closed-vocabulary navigation algorithm that can only perform navigation tasks on preset items. The authors have made some variations of its algorithm. However, whether the BERT features of the required instructions are used as input or the GPT parsing results of the instructions are used as input, the results of the algorithm are not very ideal. When switching to ZSON, an open-vocabulary navigation algorithm, due to the poor alignment effect of CLIP between demand instructions and pictures, several variants of ZSON cannot complete demand driving well. Navigation tasks. However, some algorithms based on heuristic search + LLM have low exploration efficiency due to the large scene area of the Procthor data set, and their success rate is not very high. Pure LLM algorithms, such as GPT-3-Prompt and MiniGPT-4, exhibit poor reasoning capabilities for unseen locations in the scene, resulting in inability to efficiently discover items that meet the requirements.
Ablation experiments show that Attribute Module significantly improves navigation success rate. The authors show that the t-SNE graph well demonstrates that the Attribute Module successfully learns the attribute features of items through demand-conditioned contrastive learning. After replacing the Attribute Module architecture with MLP, the performance dropped, indicating that the TransformerEncoder architecture is more suitable for capturing attribute characteristics. BERT can well extract the characteristics of required instructions, which improves the generalization of unseen instructions.
Here are some visualizations:
The corresponding author of this study, Dr. Dong Hao, is currently an assistant professor at the Frontier Computing Research Center of Peking University, a doctoral supervisor, and a liberal arts youth Scholar and intellectual scholar, he founded and led the Peking University Hyperplane Lab in 2019. He has published more than 40 papers in top international conferences/journals such as NeurIPS, ICLR, CVPR, ICCV, ECCV, etc. Google Scholar It has been cited more than 4,700 times and has won the ACM MM Best Open Source Software Award and the OpenI Outstanding Project Award. He has also served as the field chairperson and deputy editorial board member of top international conferences such as NeurIPS, CVPR, AAAI, and ICRA for many times, undertaken a number of national and provincial projects, and chaired the Ministry of Science and Technology’s New Generation Artificial Intelligence 2030 major project.
The first author of the paper, Wang Hongzhen, is currently a second-year doctoral student at the School of Computer Science, Peking University. His research interests focus on robotics, computer vision and psychology. He hopes to start from the aspects of human behavior, cognition and motivation to align the connection between humans and robots.
Reference links:
[1] https://zsdonghao.github.io/
[2] https://whcpumpkin.github.io/
The above is the detailed content of Peking University's embodied intelligence team proposes demand-driven navigation to align human needs and make robots more efficient. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics











It is also a Tusheng video, but PaintsUndo has taken a different route. ControlNet author LvminZhang started to live again! This time I aim at the field of painting. The new project PaintsUndo has received 1.4kstar (still rising crazily) not long after it was launched. Project address: https://github.com/lllyasviel/Paints-UNDO Through this project, the user inputs a static image, and PaintsUndo can automatically help you generate a video of the entire painting process, from line draft to finished product. follow. During the drawing process, the line changes are amazing. The final video result is very similar to the original image: Let’s take a look at a complete drawing.

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com The authors of this paper are all from the team of teacher Zhang Lingming at the University of Illinois at Urbana-Champaign (UIUC), including: Steven Code repair; Deng Yinlin, fourth-year doctoral student, researcher

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com In the development process of artificial intelligence, the control and guidance of large language models (LLM) has always been one of the core challenges, aiming to ensure that these models are both powerful and safe serve human society. Early efforts focused on reinforcement learning methods through human feedback (RL

cheers! What is it like when a paper discussion is down to words? Recently, students at Stanford University created alphaXiv, an open discussion forum for arXiv papers that allows questions and comments to be posted directly on any arXiv paper. Website link: https://alphaxiv.org/ In fact, there is no need to visit this website specifically. Just change arXiv in any URL to alphaXiv to directly open the corresponding paper on the alphaXiv forum: you can accurately locate the paragraphs in the paper, Sentence: In the discussion area on the right, users can post questions to ask the author about the ideas and details of the paper. For example, they can also comment on the content of the paper, such as: "Given to

If the answer given by the AI model is incomprehensible at all, would you dare to use it? As machine learning systems are used in more important areas, it becomes increasingly important to demonstrate why we can trust their output, and when not to trust them. One possible way to gain trust in the output of a complex system is to require the system to produce an interpretation of its output that is readable to a human or another trusted system, that is, fully understandable to the point that any possible errors can be found. For example, to build trust in the judicial system, we require courts to provide clear and readable written opinions that explain and support their decisions. For large language models, we can also adopt a similar approach. However, when taking this approach, ensure that the language model generates

Recently, the Riemann Hypothesis, known as one of the seven major problems of the millennium, has achieved a new breakthrough. The Riemann Hypothesis is a very important unsolved problem in mathematics, related to the precise properties of the distribution of prime numbers (primes are those numbers that are only divisible by 1 and themselves, and they play a fundamental role in number theory). In today's mathematical literature, there are more than a thousand mathematical propositions based on the establishment of the Riemann Hypothesis (or its generalized form). In other words, once the Riemann Hypothesis and its generalized form are proven, these more than a thousand propositions will be established as theorems, which will have a profound impact on the field of mathematics; and if the Riemann Hypothesis is proven wrong, then among these propositions part of it will also lose its effectiveness. New breakthrough comes from MIT mathematics professor Larry Guth and Oxford University

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com. Introduction In recent years, the application of multimodal large language models (MLLM) in various fields has achieved remarkable success. However, as the basic model for many downstream tasks, current MLLM consists of the well-known Transformer network, which

Can language models really be used for time series prediction? According to Betteridge's Law of Headlines (any news headline ending with a question mark can be answered with "no"), the answer should be no. The fact seems to be true: such a powerful LLM cannot handle time series data well. Time series, that is, time series, as the name suggests, refers to a set of data point sequences arranged in the order of time. Time series analysis is critical in many areas, including disease spread prediction, retail analytics, healthcare, and finance. In the field of time series analysis, many researchers have recently been studying how to use large language models (LLM) to classify, predict, and detect anomalies in time series. These papers assume that language models that are good at handling sequential dependencies in text can also generalize to time series.
