Almost at the same time as Stanford’s “Shrimp Fried and Dishwashing” robot, Google DeepMind also released its latest embodied intelligence results.
And it’s three consecutive shots:
First, a new model that focuses on improving decision-making speed, let The robot's operation speed (compared to the original Robotics Transformer) has increased by 14% - while being fast, the quality has not declined, and the accuracy has also increased by 10.6%.
Then there is a new frameworkspecializing in generalization ability, which can create motion trajectory prompts for the robot and let it face 41 never-before-seen tasks, achieving a 63% success rate.
Don’t underestimate this array, Compared with the previous 29%, the improvement is quite big.
Finally a robot data collection system that can manage 20 robots at a time and has currently collected 77,000 experimental data from their activities, they will help Google does a better job of subsequent training.
So, what are these three results specifically? Let’s look at them one by one.
Google pointed out that to realize a robot that can truly enter the real world, two basic challenges need to be solved.
1. Ability to promote new tasks
2. Improve decision-making speed
The first two results of this three-part series are mainly improvements in these two areas, and All are built on Google's basic robot model Robotics Transformer (RT for short).
Let’s first look at the first one: RT-Trajectory that helps robots generalize.
For humans, tasks such as cleaning tables are easy to understand, but robots don’t understand it very well.
But fortunately, we can convey this instruction to it in a variety of possible ways, so that it can take actual physical actions.
Generally speaking, the traditional way is to map the task into a specific action, and then let the robot arm complete it. For example, wiping the table can be broken down into "close the clamp, move to the left, move to the left, and close the clamp to the left." Move right".
Obviously, the generalization ability of this method is very poor.
Here, Google’s newly proposed RT-Trajectory teaches the robot to complete tasks by providing visual cues.
Specifically, robots controlled by RT-Trajectory will add 2D trajectory enhanced data during training.
These trajectories are presented as RGB images, including routes and key points, providing low-level but very useful hints as the robot learns to perform tasks.
With this model, the success rate of robots performing never-before-seen tasks has been directly increased by as much as 1 times (compared to Google's basic robot model RT-2, from 29%=> 63%).
What’s more worth mentioning is that RT-Trajectory can create trajectories in a variety of ways, including:
By watching human demonstrations, accepting hand-drawn sketches, and through VLM (Visual Language Model) to generate.
After the generalization ability is improved, we will focus on the decision-making speed.
Google’s RT model uses the Transformer architecture. Although the Transformer is powerful, it relies heavily on the attention module with quadratic complexity.
Therefore, once the input to the RT model is doubled (for example, by equipping the robot with a higher-resolution sensor), the computational resources required to process it will increase by four times. This will severely slow down decision-making.
In order to improve the speed of robots, Google developedSARA-RT on the basic model Robotics Transformer.
SARA-RT uses a new model fine-tuning method to make the original RT model more efficient.
This method is called "up training" by Google. Its main function is to convert the original quadratic complexity into linear complexity, and at the same time Maintain processing quality.
When SARA-RT is applied to the RT-2 model with billions of parameters, the latter can achieve faster operation speeds and higher accuracy on a variety of tasks.
It is also worth mentioning that SARA-RT provides a universal method to accelerate Transformer without expensive pre-training, so it can Well promoted.
Finally, in order to help robots better understand the tasks assigned by humans, Google also started with data and directly built a collection system: AutoRT.
This system combines the large model (including LLM and VLM) with the robot control model (RT) to continuously command the robot to perform various tasks in the real world. tasks to generate and collect data.
The specific process is as follows:
Let the robot "freely" contact the environment and get close to the target.
Then use the camera and VLM model to describe the scene in front of you, including the specific items.
Then, LLM uses this information to generate several different tasks.
Note that the robot will not be executed immediately after being generated. Instead, LLM will be used to filter which tasks can be completed independently, which ones require human remote control, and which ones It simply cannot be completed.
What cannot be accomplished is "opening the bag of potato chips" because it requires two robotic arms (only 1 by default) .
Then, after completing this screening task, the robot can actually execute it.
Finally, the AutoRT system completes data collection and conducts diversity assessment.
According to reports, AutoRT can coordinate up to 20 robots at a time. Within 7 months, a total of 77,000 test data including 6,650 unique tasks were collected.
Finally, for this system, Google also emphasizes security.
After all, AutoRT’s collection tasks affect the real world, and “safety guardrails” are indispensable.
Specifically, the Basic Safety Code is provided by the LLM that performs task screening for robots, and is partly inspired by Isaac Asimov’s Three Laws of Robotics – first and foremost “Robots” Must not harm humans.
The second requirement is that the robot must not attempt tasks involving humans, animals, sharp objects or electrical appliances.
But this is not enough.
So AutoRT It is also equipped with multiple layers of practical safety measures found in conventional robotics.
For example, the robot automatically stops when the force on its joints exceeds a given threshold, and all actions can be controlled by physical switches that remain within human sight. Stop and wait.
Want to know more about these latest results from Google?
Good news, except for RT-Trajectory, which only has online papers, the rest are The code and paper are released together, and everyone is welcome to check it out~
Speaking of Google robots, we have to mention RT-2( All the results of this article are also based on).
This model was built by 54 Google researchers for 7 months and came out at the end of July this year.
embedded visual-text The multi-modal large model VLM can not only understand "human speech", but can also reason about "human speech" and perform some tasks that cannot be accomplished in one step, such as extracting information from three plastic toys: a lion, a whale, and a dinosaur. It's amazing to accurately pick up "extinct animals".
#Now it has achieved generalization ability and decision-making speed in just over 5 months The rapid improvement of robots can't help but make us sigh: I can't imagine how fast robots will really break into thousands of households.
The above is the detailed content of Google's DeepMind robot has released three results in a row! Both capabilities have been improved, and the data collection system can manage 20 robots at the same time.. For more information, please follow other related articles on the PHP Chinese website!