Google scientists speak personally: How to implement embodied reasoning? Let the large model 'speak' the language of the robot-AI-php.cn

Home

Google scientists speak personally: How to implement embodied reasoning? Let the large model 'speak' the language of the robot

王林

Apr 12, 2023 pm 12:25 PM

robot machine learning

With the development of large-scale language models, can it use its capabilities to guide robots to understand complex instructions and complete more advanced tasks? And what challenges will we face in this process? Recently, Zhiyuan Community invited Dr. Xia Fei, a Google research scientist, to give a report on "Embodied Reasoning Based on Language and Vision", detailing the team's cutting-edge work in this emerging field.

About the author: Xia Fei is currently working as a research scientist in the robotics team of Google Brain. His main research direction is the application of robots to unstructured complex environments. His representative work includes GibsonEnv, iGibson, SayCan, etc. His research has been reported by WIRED, the Washington Post, the New York Times and other media. Dr. Xia Fei graduated from Stanford University where he studied under Silvio Savarese and Leonidas Guibas. He has published many articles in conferences and journals such as CVPR, CoRL, IROS, ICRA, Neurips, RA-L, Nature Communications, etc. His recent research direction is to use foundation models (Foundation Models) in the decision-making process of intelligent agents. His team recently proposed the PaLM-SayCan model.

01 Background

Machine learning for robots has made great achievements in recent years Great progress has been made, but there are still relatively big problems. Machine learning requires a lot of data to train, but the data generated by robots is very expensive, and the robots themselves are also subject to loss.

#When humans are children, they interact with the physical world through play and learn many physical laws. Inspired by this, can the robot also interact with the environment to obtain this physical information to complete various tasks? Applying machine learning to robots relies heavily on simulation environments.

In this regard, Dr. Xia Fei and his colleagues have proposed work such as Gibson Env (Environment) and iGibson. The former focuses on the reconstruction of the visual environment, and the latter Others focus on physical simulation. By conducting three-dimensional scanning and reconstruction of the real world, and rendering visual signals through neural networks, a simulation environment is created, allowing a variety of robots to perform physical simulations and learn control from time sensors to actuators. In the iGibson environment, robots can learn richer interactions with the environment, such as learning to use a dishwasher.

Google scientists speak personally: How to implement embodied reasoning? Let the large model speak the language of the robot

## Dr. Xia Fei believes that the above work represents the transformation from Internet AI to embodied AI. In the past, AI training was mainly based on data sets such as ImageNet and MS COCO, which were Internet tasks. Embodied AI requires the AI's perception and action to form a closed loop - the AI must decide the next action based on the perceived results. Xia Fei's doctoral thesis "large scale simulation for embodied perception and robot learning" is about large-scale robot simulation for learning, perception and reasoning.

#In recent years, basic models have developed rapidly in the field of artificial intelligence. Some researchers believe that instead of relying on the simulation environment, information can be extracted from the basic model to help the robot make decisions. Dr. Xia Fei called this new direction "Foundation Model for Decision Making", and he and his team proposed work such as PaLM-SayCan.

PaLM-SayCan: Let the language model guide the robot

1. Why is robot difficult? Handling complex, long-range tasks?

The PaLM-SayCan team has 45 authors. It is a collaborative project between the Google Robotics Team and Everyday Robots, with the purpose of exploring the use of machine learning to change robots. fields, and let robots provide data to improve machine learning capabilities. Research focuses on two issues: unstructured complex environments, and making robots more useful in daily life.

Although people already have personal assistants like Siri or Alexa, there is no such presence in the field of robotics. Dr. Xia gave this example: When a drink is spilled, we want to explain the situation to the robot and ask it for help. Or if you are tired after exercise, ask it to provide drinks and snacks. Research hopes that robots can understand and perform these tasks.

The current difficulty is that it is still difficult for robots to do long-term or long-distance tasks, and they are still incapable of tasks that require complex planning, common sense and reasoning. The reasons are two aspect. The first is the lack of good user interaction interfaces in the field of robotics. When traditional robots perform Pick&Place tasks, they usually use Goal-conditioning or One-hot Conditioning methods. Goal-conditioning needs to tell the robot what the goal is and let the robot perform the task of changing the initial conditions to the goal conditions. This requires first demonstrating to the robot what the task conditions will be after completion. Kind of.

Google scientists speak personally: How to implement embodied reasoning? Let the large model speak the language of the robot

##And One-hot Conditioning uses One-hot coding, which is suitable for all tasks that the robot can complete (such as 100 tasks) are numbered from 0 to 99. Every time it needs to be executed, a number is provided to the robot, and it knows what task to complete. However, the problem with the one-hot condition is that the user needs to remember the encoding corresponding to each task, and the one-hot encoding does not obtain the dependency information between tasks (such as completing the task encoding sequence corresponding to a goal).

As a result, the current robots can only complete short-range tasks, usually grabbing and placing. And the robot itself is static rather than mobile. In addition, the environment is also limited to scenes such as laboratories, often without humans.

#2. Language model is used for robots: How to make it "speak" the robot's language?

In order to solve these problems, the team thought of using the basic model. Language models can replace Goal-conditioning and describe tasks clearly and unambiguously through language. At the same time, language also contains dependency information between task steps, such as the first step, the second step, etc. on the recipe, to provide help for robot learning. In addition, language can also define long-term tasks and solve the limitations of imitation learning methods.

Google scientists speak personally: How to implement embodied reasoning? Let the large model speak the language of the robot

Using large models on robots may face some challenges. The most important thing is to determine the language that is oriented towards the output of the robot. The large model is trained based on human natural language, and the tasks it outputs may not be possible for robots. And the language model wasn't trained on the robot's data, so it doesn't know the scope of the robot's capabilities. The second is the Grounding problem. The large model has not personally experienced the physical world and lacks embodied information. The third one is the safety and interpretability of the robot itself under the guidance of a large model. Biases in language models may be amplified by their association with physical systems, causing real-world consequences.

Google scientists speak personally: How to implement embodied reasoning? Let the large model speak the language of the robot

There is an example of credibility: when a human user communicates with Google's LaMDA model, the user asks the model "favorite island", and the model answers Crete, Greece. Island, and can also answer some reasons. But this result is not credible, because the result that the AI should give is "I don't know which island I like best, because I have never been to any island." The problem with the language model is that it has not interacted with the real world and only outputs the most likely next sentence based on statistical rules.

#If language models are used on robots, different models will give different results, some of which are not useful for driving the robot to perform tasks. For example, if a user asks the robot to "clean up a spilled drink," GPT-3 might say, "You can use a vacuum cleaner." This result is not entirely correct because vacuum cleaners cannot clean liquids.

If it is a LaMDA model, LaMDA may say "Do you want me to help you find a cleaner?" This answer is normal, but it is not practical. Useful, because LaMDA fine-tunes the dialogue material, and its objective function is to extend the length of the dialogue as much as possible, not to help complete the task. If it is a FLAN model, it will reply "Sorry, I didn't mean it." It does not understand the user's intention: is it a conversation? Still need to solve a problem? Therefore, there are a series of problems in using large language models on robots.

PaLM-SayCan works to solve these challenges. The first is to enable the large model to speak the language of the robot through Few-shot Prompting (few-shot learning). For example, construct tasks such as "get the coffee to the cupboard", "give me an orange", etc., and give the corresponding steps (such as 1-5 and 1-3). The user then gives the model an instruction: "Put an apple on the table." After having the previous step prompts, the model will find and combine the appropriate task steps on its own, and generate a plan to complete the task step by step.

Google scientists speak personally: How to implement embodied reasoning? Let the large model speak the language of the robot

##It should be noted that there are two main ways to interact with large models, One is the Generative Interface, which generates the next Token based on the input; the other is the Scoring Interface, which calculates the likelihood function for a given Token. PaLM-SayCan uses a scoring method, which makes the language model more stable and easy to output the desired results. In the task of placing an apple, the model scores various steps and selects the appropriate outcome.

3. Bridging the gap between the language model and the real world: Let the robot explore the affordances of the environment

There is another problem that needs to be solved: when the language model generates the task steps, it does not know what the robot can currently do. If there is no apple in front of the robot, the robot cannot complete the task of placing the apple. Therefore, this requires letting the language model know what tasks the robot can do in the current environment and state. A new concept needs to be introduced here, called Robotic Affordances, which is also the core of this work.

Google scientists speak personally: How to implement embodied reasoning? Let the large model speak the language of the robot

##Affordances translated into Chinese is called affordances, which is the name of American psychologist James J. A concept proposed by Gibson around 1977, is defined as the tasks that an agent can do in an environment in the current state, which represents its affordances. sex. Affordance can be obtained using supervised learning, but this requires a large amount of data and labeling.

In this regard, the team adopted a reinforcement learning method and used the value function of Policy to approximate affordances. For example, train a robot to grab various things in the environment. After training, let the robot explore the room. When it sees an item in front of it, the value function of picking up the item will become very high, thus replacing the available items. sexual prediction.

Google scientists speak personally: How to implement embodied reasoning? Let the large model speak the language of the robot

Combining affordances and language models, the PaLM-SayCan algorithm is obtained. As shown in the figure above, the left side is the language model, which scores the tasks that the robot can complete according to the instructions given by the user, and calculates the probability that completing the sub-tasks will help complete the overall task. The right side is the value function, which represents the probability of completing each task in the current state. The product of the two represents the probability that the robot successfully completes a subtask that contributes to the overall task. In the example of Apple, there is no apple in front of the robot in the current state. To complete this task, the first thing to do is to find the apple, so the affordance score of finding the apple is relatively high, and the score of grabbing the apple is low. After finding the apple, the affordance score for grabbing the apple increases, and the task of grabbing the apple is performed. This process is repeated until the overall task is completed.

Google scientists speak personally: How to implement embodied reasoning? Let the large model speak the language of the robot

#03

More Embodied Intelligence Work: Improving the Model Reasoning ability, using environmental feedback to form a closed loop

1.Chain of Thought Prompting: Understanding complex common sense

In addition to PaLM-SayCan, Dr. Xia and colleagues have also completed some other related work. In terms of Prompt, the team proposed Chain of Thought Prompting (which can be understood as problem-solving ideas) to give the language model more reasoning capabilities.

Google scientists speak personally: How to implement embodied reasoning? Let the large model speak the language of the robot

The standard Prompt mode is to design a question template and give the answer. The model outputs answers during inference, but sometimes the answers given by the model are wrong. Therefore, the goal of Chain of Thought Prompting is to provide an explanation to the model while providing the problem, which can significantly improve the results of the model and even surpass human levels on some tasks.

Google scientists speak personally: How to implement embodied reasoning? Let the large model speak the language of the robot

The model is prone to errors when processing negative sentences. For example, a human user proposes "Give me a fruit, but not an apple". Models tend to provide an apple because there are apples in both the question and the executable options. Using Chain of Thought Prompting, some explanations can be provided. For example, the model would output "The user wants a fruit, but not an apple. A banana is a fruit, not an apple. I can give the user a banana."

Chain of Thought Prompting can also solve more subtle negative requirements. For example, a user expresses an allergy to caffeine and asks the robot to get a drink. Allergies are another subtle form of negation. Using traditional methods, the robot might reach for a caffeinated drink (without understanding the negation that allergies represent). Chain of Thought Prompting can explain allergies, etc. and improve the reasoning effect.

2.Inner Monologue: Correct errors and return to the correct execution track

Use large models to make robot decisions and The combination of environmental interaction is also an important direction of research. The team proposed the work of Inner Monologue, which aims to allow the language model to review past decisions based on changes in the environment and recover from wrong instructions or accidents caused by the environment.

Google scientists speak personally: How to implement embodied reasoning? Let the large model speak the language of the robot

##For example, when humans go home and find that the selected key cannot open the door, then people will choose Try another key, or change the direction of rotation. What this embodies is correcting errors and updating new actions based on feedback from the environment. Inner Monologue works in this way. For example, if the cola falls while the robot is grabbing a cola, subsequent tasks cannot be completed. Inner Monologue is needed to detect whether the task is completed successfully, and put feedback into the decision-making process, and make new decisions based on the feedback information. decision.

Google scientists speak personally: How to implement embodied reasoning? Let the large model speak the language of the robot

#As shown in the figure, the Inner Monologue work includes active scene description (Active Scene Description) and task success detector (Success Detector). When humans give instructions, the model can execute the instructions and activate scenario descriptions to assist the robot in decision-making. The training process still uses the Few-shot Prompt method, so that it can draw inferences from one example. For example, when the robot is instructed to get a drink, it will ask the human whether to get a Coke or a soda.

Google scientists speak personally: How to implement embodied reasoning? Let the large model speak the language of the robot

##Another case is the reasoning of historical information by language models. In many cases, humans will Change your mind, or ask the robot to complete the "just task" after changing the instructions multiple times. Here, the "just task" is not specified, which requires the model to look back in history to see what the previous tasks were. In addition to English, Inner Monologue is currently available in Chinese and other languages. After experiments in other fields, the team found that this environmental feedback method can complete some very complex and closed-loop planning tasks.

04 Q&A

Q: Is the large language model of PaLM-SayCan trained from scratch? Still only used the model.

#A: The big prediction model does not need to be Fine-tune, it already contains many decisions information. For example, you can use GPT-3 with 175 billion parameters, or PaLM, which already contains enough mission planning and sequence information.

Q: When working on Inner Monologue, will the Agent also take the initiative to ask questions? How was this absorbed?

#A: We use the language model and Prompt method. When the robot completes a task, Two options will appear: "and ask" and "and continue". Whether to ask a question or to continue depends on whether there is any ambiguity in the contextual semantics.

Q: How does the robot know where an item is (such as potato chips in the drawer)? If the capabilities of robots gradually increase in the future, will the search space be too large during exploration?

#A: The robot’s knowledge of the storage location of items is currently hard-coded, not a Automatic process. But the large language model also contains certain semantic knowledge, such as where the items are. This semantic knowledge can reduce the search space. At the same time, you can also explore based on the probability of finding items. Currently, Xia Fei's team has published a new work to solve this problem. The core idea is to establish a natural language indexed scene representation. Reference website nlmap-saycan.github.io

Google scientists speak personally: How to implement embodied reasoning? Let the large model speak the language of the robot

##Q: In addition, the hierarchical reinforcement learning that has emerged in recent years, Does it provide some inspiration for complex task planning?

PaLM-SayCan is similar to hierarchical reinforcement learning, with underlying skills and upper-level tasks The planning can be said to be a hierarchical method, but it is not hierarchical reinforcement learning. I personally prefer this layered approach, because when planning tasks, you don't necessarily have to do every detailed step, which would be a waste of time. Mission planning can be trained using massive Internet data, but the underlying skills require physical data, so they need to interact with the environment and learn.

Q: When PaLM-SayCan is actually used in robots, are there any fundamental issues that remain unresolved? If it can be used as a replacement for daily nanny, how long will it take to realize it?

#A: There are still some fundamental issues that have not been resolved, and they are not simple engineering issues. . In terms of principle, the underlying motion control and grasping of the robot is a big challenge. We are still unable to achieve 100% grasping success, which is a big problem.

#Of course, it can already provide some value to people with limited mobility. However, if it is truly a commercial product, it is not possible yet. The mission success rate is about 90%, which does not meet commercial requirements.

#Q: Is the success rate of robot planning limited by the training data set?

A: The robot’s planning ability is limited by the training corpus. It is easy to find some instructions in the corpus, such as "throw away the garbage". However, there is almost no corpus such as "move the robot's two-finger claw to the right 10 centimeters" in the corpus, because people will not leave such information on the Internet. This involves the problem of granular information. Currently, limited by corpus, robots can only complete coarse-grained tasks.

On the other hand, fine-grained planning itself should not be done by the language model, because it contains too much physical information and is likely to be unusable described in human language. One idea is that fine-grained operations can be implemented using imitation learning (refer to BC-Z work) or code generation (refer to the team's latest work https://code-as-policies.github.io/). The larger role of the large model is to serve as the user's interactive interface, interpret the instructions given by humans to the robot, and decompose them into steps that the machine can perform.

#In addition, the language can do high-level semantic planning without the need for more physical planning. If you want to achieve fine-grained planning tasks, you still have to rely on imitation learning or reinforcement learning.

The above is the detailed content of Google scientists speak personally: How to implement embodied reasoning? Let the large model 'speak' the language of the robot. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks ago By DDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

2 weeks ago By DDD

Where to find the Crane Control Keycard in Atomfall

3 weeks ago By DDD

Saving in R.E.P.O. Explained (And Save Files)

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows - How To Find The Blacksmith And Unlock Weapon And Armour Customisation

4 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7564

CakePHP Tutorial

1385

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

Related knowledge

This article will take you to understand SHAP: model explanation for machine learning Jun 01, 2024 am 10:58 AM

In the fields of machine learning and data science, model interpretability has always been a focus of researchers and practitioners. With the widespread application of complex models such as deep learning and ensemble methods, understanding the model's decision-making process has become particularly important. Explainable AI|XAI helps build trust and confidence in machine learning models by increasing the transparency of the model. Improving model transparency can be achieved through methods such as the widespread use of multiple complex models, as well as the decision-making processes used to explain the models. These methods include feature importance analysis, model prediction interval estimation, local interpretability algorithms, etc. Feature importance analysis can explain the decision-making process of a model by evaluating the degree of influence of the model on the input features. Model prediction interval estimate

Identify overfitting and underfitting through learning curves Apr 29, 2024 pm 06:50 PM

This article will introduce how to effectively identify overfitting and underfitting in machine learning models through learning curves. Underfitting and overfitting 1. Overfitting If a model is overtrained on the data so that it learns noise from it, then the model is said to be overfitting. An overfitted model learns every example so perfectly that it will misclassify an unseen/new example. For an overfitted model, we will get a perfect/near-perfect training set score and a terrible validation set/test score. Slightly modified: "Cause of overfitting: Use a complex model to solve a simple problem and extract noise from the data. Because a small data set as a training set may not represent the correct representation of all data." 2. Underfitting Heru

The evolution of artificial intelligence in space exploration and human settlement engineering Apr 29, 2024 pm 03:25 PM

In the 1950s, artificial intelligence (AI) was born. That's when researchers discovered that machines could perform human-like tasks, such as thinking. Later, in the 1960s, the U.S. Department of Defense funded artificial intelligence and established laboratories for further development. Researchers are finding applications for artificial intelligence in many areas, such as space exploration and survival in extreme environments. Space exploration is the study of the universe, which covers the entire universe beyond the earth. Space is classified as an extreme environment because its conditions are different from those on Earth. To survive in space, many factors must be considered and precautions must be taken. Scientists and researchers believe that exploring space and understanding the current state of everything can help understand how the universe works and prepare for potential environmental crises

Implementing Machine Learning Algorithms in C++: Common Challenges and Solutions Jun 03, 2024 pm 01:25 PM

Common challenges faced by machine learning algorithms in C++ include memory management, multi-threading, performance optimization, and maintainability. Solutions include using smart pointers, modern threading libraries, SIMD instructions and third-party libraries, as well as following coding style guidelines and using automation tools. Practical cases show how to use the Eigen library to implement linear regression algorithms, effectively manage memory and use high-performance matrix operations.

How can AI make robots more autonomous and adaptable? Jun 03, 2024 pm 07:18 PM

In the field of industrial automation technology, there are two recent hot spots that are difficult to ignore: artificial intelligence (AI) and Nvidia. Don’t change the meaning of the original content, fine-tune the content, rewrite the content, don’t continue: “Not only that, the two are closely related, because Nvidia is expanding beyond just its original graphics processing units (GPUs). The technology extends to the field of digital twins and is closely connected to emerging AI technologies. "Recently, NVIDIA has reached cooperation with many industrial companies, including leading industrial automation companies such as Aveva, Rockwell Automation, Siemens and Schneider Electric, as well as Teradyne Robotics and its MiR and Universal Robots companies. Recently,Nvidiahascoll

Cloud Whale Xiaoyao 001 sweeping and mopping robot has a 'brain'! | Experience Apr 26, 2024 pm 04:22 PM

Sweeping and mopping robots are one of the most popular smart home appliances among consumers in recent years. The convenience of operation it brings, or even the need for no operation, allows lazy people to free their hands, allowing consumers to "liberate" from daily housework and spend more time on the things they like. Improved quality of life in disguised form. Riding on this craze, almost all home appliance brands on the market are making their own sweeping and mopping robots, making the entire sweeping and mopping robot market very lively. However, the rapid expansion of the market will inevitably bring about a hidden danger: many manufacturers will use the tactics of sea of machines to quickly occupy more market share, resulting in many new products without any upgrade points. It is also said that they are "matryoshka" models. Not an exaggeration. However, not all sweeping and mopping robots are

Five schools of machine learning you don't know about Jun 05, 2024 pm 08:51 PM

Machine learning is an important branch of artificial intelligence that gives computers the ability to learn from data and improve their capabilities without being explicitly programmed. Machine learning has a wide range of applications in various fields, from image recognition and natural language processing to recommendation systems and fraud detection, and it is changing the way we live. There are many different methods and theories in the field of machine learning, among which the five most influential methods are called the "Five Schools of Machine Learning". The five major schools are the symbolic school, the connectionist school, the evolutionary school, the Bayesian school and the analogy school. 1. Symbolism, also known as symbolism, emphasizes the use of symbols for logical reasoning and expression of knowledge. This school of thought believes that learning is a process of reverse deduction, through existing

Explainable AI: Explaining complex AI/ML models Jun 03, 2024 pm 10:08 PM

Translator | Reviewed by Li Rui | Chonglou Artificial intelligence (AI) and machine learning (ML) models are becoming increasingly complex today, and the output produced by these models is a black box – unable to be explained to stakeholders. Explainable AI (XAI) aims to solve this problem by enabling stakeholders to understand how these models work, ensuring they understand how these models actually make decisions, and ensuring transparency in AI systems, Trust and accountability to address this issue. This article explores various explainable artificial intelligence (XAI) techniques to illustrate their underlying principles. Several reasons why explainable AI is crucial Trust and transparency: For AI systems to be widely accepted and trusted, users need to understand how decisions are made

See all articles

Google scientists speak personally: How to implement embodied reasoning? Let the large model 'speak' the language of the robot

01 Background

PaLM-SayCan: Let the language model guide the robot​

1. Why is robot difficult? Handling complex, long-range tasks? ​

#2. Language model is used for robots: How to make it "speak" the robot's language?

3. Bridging the gap between the language model and the real world: Let the robot explore the affordances of the environment​

More Embodied Intelligence Work: Improving the Model Reasoning ability, using environmental feedback to form a closed loop

1.Chain of Thought Prompting: Understanding complex common sense​

2.Inner Monologue: Correct errors and return to the correct execution track

Hot AI Tools

Undresser.AI Undress

AI Clothes Remover

Undress AI Tool

Clothoff.io

AI Hentai Generator

Hot Article

Hot Tools

Notepad++7.3.1

SublimeText3 Chinese version

Zend Studio 13.0.1

Dreamweaver CS6

SublimeText3 Mac version

Hot Topics

PaLM-SayCan: Let the language model guide the robot

1. Why is robot difficult? Handling complex, long-range tasks?

3. Bridging the gap between the language model and the real world: Let the robot explore the affordances of the environment

1.Chain of Thought Prompting: Understanding complex common sense