Home > Technology peripherals > AI > body text

How many steps does it take to install an elephant in the refrigerator? NVIDIA releases ProgPrompt, allowing language models to plan plans for robots

WBOY
Release: 2023-04-14 12:49:08
forward
1103 people have browsed it

For robots, Task Planning is an unavoidable problem.

If you want to complete a real-world task, you must first know how many steps it takes to install an elephant in the refrigerator.

Even the relatively simple

throwing an apple task contains multiple sub-steps, and the robot must observe the position of the apple, if does not see the apple, we must continue to look for , then approach the apple, grab the apple,find and Near the trash can.

If the trash can is closed, you must open it first, and then Throw the apple in and close the trash can.

But the

specific implementation details of each task cannot be designed by humans. How to generate the action sequence through a command is enough. problem.

Use

command to generate sequence ? Isn't this exactly the job of Language Model?

In the past, researchers have used large language models (LLMs) to score the potential next action space based on input task instructions and then generate action sequences.

Instructions are described in natural language and do not contain additional domain information.

But such methods either need to enumerate all possible next actions for scoring, or the generated text has no restrictions in form, which may contain specific robots in the current environment

impossibleaction.

Recently, the University of Southern California and NVIDIA jointly launched a new model

ProgPrompt, which also uses a language model to perform task planning on input instructions, which includes a The programmed prompt structure enables the generated plans to work in different environments, robots with different abilities, and different tasks.

How many steps does it take to install an elephant in the refrigerator? NVIDIA releases ProgPrompt, allowing language models to plan plans for robots

In order to ensure the standardization of the task, the researchers used

generated python style code to prompt the language model which actions are available, what objects are in the environment, and which programs are executable.

For example, enter the

"throw apple" command to generate the following program.

How many steps does it take to install an elephant in the refrigerator? NVIDIA releases ProgPrompt, allowing language models to plan plans for robots

The ProgPrompt model achieved sota performance

on the virtual home task, and the researchers also deployed the model on a Physical Robot Arm for Desktop Tasks on. Magical Language Model

Completing daily household tasks requires both a common sense understanding of the world and situational knowledge of the current environment.

In order to create a task plan of "cooking dinner", the minimum knowledge that the agent needs to know includes:

Functions of objects, such as stoves and microwave ovens can be used heating; logical sequence of actions, the oven must be preheated before adding food; and task relevance of objects and actions, such as heating and finding ingredients are first related to "dinner" action.

But without

state feedback (state feedback), this kind of reasoning cannot be carried out.

The agent needs to know

where there is food in the current environment, such as whether there is fish in the refrigerator, or whether there is chicken in the refrigerator.

Autoregressive large-scale language models trained on large corpora can generate text sequences under the condition of input prompts and have significant multi-task generalization capabilities.

For example, if you enter "make dinner", the language model can generate subsequent sequences, such as opening the refrigerator, picking up the chicken, picking up the soda, closing the refrigerator, turning on the light switch, etc.

The generated text sequence needs to be mapped to the action space of the agent. For example, if the generated instruction is "reach out and pick up a jar of pickles", the corresponding executable action may be "pick up jar", the model then calculates a probability score for an action.

But in the absence of environmental feedback, if there is no chicken in the refrigerator and you still choose to "pick up the chicken", the task will fail because "making dinner" does not include Any information about the state of the world.

The ProgPrompt model cleverly utilizes programming language structures in task planning, because existing large-scale language models are usually conducted in the corpus of programming tutorials and code documents Pre-training.

ProgPrompt provides the language model with a Pythonic program header as a prompt, importing the available action space, expected parameters, and available objects in the environment.

How many steps does it take to install an elephant in the refrigerator? NVIDIA releases ProgPrompt, allowing language models to plan plans for robots

Then define such as make_dinner, throw_away_banana and other functions, the main body of which is to operate objects The action sequence is then incorporated by asserting the planned prerequisites , such as approaching the refrigerator before trying to open it, and responding to assertion failures with recovery actions Environment status feedback.

The most important thing is that the ProgPrompt program also includes comments written in natural language to explain the goals of the action, thereby improving the execution of the generated plan program Mission success rate.

ProgPrompt

With the complete idea, the overall workflow of ProgPrompt is clear, which mainly includes three parts, Pythonic function Construction , Constructing programming language prompts , Generation and execution of task plans .

How many steps does it take to install an elephant in the refrigerator? NVIDIA releases ProgPrompt, allowing language models to plan plans for robots

1. Express the robot plan as a Pythonic function

Planning functions include API calls to action primitives , summarizing actions and adding comments, and assertions to track execution.

Each action primitive requires an object as a parameter. For example, the "Put salmon into the microwave" task includes a call to find(salmon), where find is an action primitive. .

How many steps does it take to install an elephant in the refrigerator? NVIDIA releases ProgPrompt, allowing language models to plan plans for robots

Use comments in the code to provide natural language summaries for subsequent action sequences. Comments help break down high-level tasks into appropriate The logical subtasks are "catch the salmon" and "put the salmon in the microwave".

Annotations can also allow the language model to understand the current goal and reduce the possibility of incoherent, inconsistent or repeated output, similar to a chain of thought Generate intermediate results.

Assertions (assertions) Provides an environment feedback mechanism to ensure that preconditions are true and to implement error recovery when they are not true, such as before a crawl action. The plan asserts that the agent is close to the salmon, otherwise the agent needs to perform a find action first.

2. Constructing programming language prompt

prompt needs to provide information about the environment to the language model and main action information, including observations, action primitives, examples, and generated a Pythonic prompt for language model completion.

How many steps does it take to install an elephant in the refrigerator? NVIDIA releases ProgPrompt, allowing language models to plan plans for robots

Then, the language model predicts as an executable function, namely microwave_salmon()

in microwave salmon In this task, a reasonable first step that LLM can generate is to take out the salmon, but the agent responsible for executing the plan may not have such an action primitive.

In order for the language model to understand the action primitives of the agent, import them through the import statement in prompt, which also limits the output to functions available in the current environment.

To change the behavior space of the agent, you only need toupdate the import function list.

The variable objects provides all available objects in the environment as a list of strings.

#prompt also includes a number of fully executable program plans as examples. Each example task demonstrates how to complete a given task using the available actions and goals in a given environment. , such as throw_away_lime

3, generation and execution of task plan

given task After that, the plan is completely inferred by the language model based on the ProgPrompt prompt, and then the generated plan can be executed on the virtual agent or physical robot system. An interpreter is required to execute each action command according to the environment.

During execution, assertion checks are performed in a closed-loop manner and feedback is provided based on the current environment state.

In the experimental part, the researchers evaluated the method on the Virtual Home (VH) simulation platform.

The status of VH includes a set of objects and corresponding attributes, such as salmon inside the microwave oven (in), or close to (agent_close_to), etc.

The action space includes grab, putin, putback, walk, find, open, close close) etc.

Finally, 3 VH environments were experimented, each environment included 115 different objects. The researchers created a data set containing 70 housework tasks, with a high level of abstraction and command It's all about "microwave salmon" and creating a ground-truth action sequence for it.

After evaluating the generated program on the virtual family, the evaluation indicators include success rate (SR), goal conditional recall (GCR) and executability (Exec). From the results It can be seen that ProgPrompt is significantly better than the baseline and LangPrompt. The table also shows how each feature improves performance.

How many steps does it take to install an elephant in the refrigerator? NVIDIA releases ProgPrompt, allowing language models to plan plans for robots

The researchers also conducted experiments in the real world, using a Franka-Emika panda robot with parallel claws, And assume that a pick-and-place strategy can be obtained.

This strategy takes two point clouds of the target object and the target container as input, and performs pick and place operations to place the object on or inside the container.

The system implementation introduces an open vocabulary object detection model ViLD to identify and segment objects in the scene, and build a list of available objects in the prompt.

Unlike in the virtual environment, the object list here is a local variable of each planning function, which allows more flexibility to adapt to new objects.

The plan output by the language model contains function calls in the form of grab and putin.

Due to real-world uncertainties, the assertion-based closed-loop option was not implemented in the experimental setup.

How many steps does it take to install an elephant in the refrigerator? NVIDIA releases ProgPrompt, allowing language models to plan plans for robots

It can be seen that in the classification task, the robot was able to identify bananas and strawberries as fruits and generate planning steps to place them on the plate inside and put the bottle in the box.

The above is the detailed content of How many steps does it take to install an elephant in the refrigerator? NVIDIA releases ProgPrompt, allowing language models to plan plans for robots. For more information, please follow other related articles on the PHP Chinese website!

Related labels:
source:51cto.com
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template