


Let the robot sense your 'Here you are', the Tsinghua team uses millions of scenarios to create universal human-machine handover
Researchers from the Interdisciplinary Information Institute of Tsinghua University proposed a framework called "GenH2R", which aims to allow robots to learn a universal vision-based human-machine handover strategy. This strategy allows the robot to more reliably catch various objects with diverse shapes and complex motion trajectories, bringing new possibilities for human-computer interaction. This research provides an important breakthrough for the development of the field of artificial intelligence and brings greater flexibility and adaptability to the application of robots in real-life scenarios.
With the advent of the era of embodied intelligence (Embodied AI), we expect intelligent bodies to actively interact with the environment. In this process, it has become crucial to integrate robots into the human living environment and interact with humans (Human Robot Interaction). We need to think about how to understand human behavior and intentions, meet their needs in a way that best meets human expectations, and put humans at the center of embodied intelligence (Human-Centered Embodied AI). One of the key skills is Generalizable Human-to-Robot Handover, which enables robots to better cooperate with humans to complete a variety of common daily tasks, such as cooking, home organization, and furniture assembly. .
The explosive development of large models indicates that large-scale learning from massive high-quality data is a possible way to move towards general intelligence. So, can general intelligence be obtained through massive robot data and large-scale strategy imitation? Human-machine handover skills? However, if you consider that large-scale interactive learning between robots and humans in the real world is dangerous and expensive, the machines are likely to harm humans:
Train in a simulation environment, and use character simulation and dynamic grasping motion planning to automatically provide a large amount of diverse robot learning data, and then apply these data to real robots. This learning-based method is called " Sim-to-Real Transfer", which can significantly improve the collaborative interaction capabilities between robots and humans and has higher reliability.
Therefore, the "GenH2R" framework was proposed, starting from three perspectives: Simulation, Demonstration, and Imitation. ,Let the robot learn universal handover for any grasping method, any handover trajectory, and any object geometry for the first time based on an end-to-end approach: 1) Provides millions of levels in the "GenH2R-Sim" environment Various complex simulation handover scenarios that are easy to generate, 2) introduce a set of automated expert demonstrations (Expert Demonstrations) generation process based on vision-action collaboration, 3) use imitation learning based on 4D information and prediction assistance (point cloud time) (Imitation Learning) method.
Compared with the SOTA method (CVPR2023 Highlight), the average success rate of GenH2R's method on various test sets is increased by 14%, the time is shortened by 13%, and on real machines The performance is more robust in experiments.
- ##Paper address: https://arxiv.org/abs/2401.00929
- Paper homepage: https://GenH2R.github.io
- Paper video: https://youtu.be/BbphK5QlS1Y
Method introduction
#In order to help players who have not yet passed the level, let us learn about the details of "Simulation Environment (GenH2R-Sim)" How to solve the puzzle.
To generate high-quality, large-scale human hand-object datasets, the GenH2R-Sim environment models the scene in terms of both grasping poses and motion trajectories.
In terms of grasping postures, GenH2R-Sim introduces rich 3D object models from ShapeNet, selects 3266 daily objects suitable for handover, and uses the generation method of dexterous grasping (DexGraspNet), a total of 1 million scenes of human hands grasping objects were generated. In terms of motion trajectories, GenH2R-Sim uses several control points to generate multiple smooth Bézier curves, and introduces the rotation of human hands and objects to simulate various complex motion trajectories of hand-delivered objects.
In the 1 million scenes of GenH2R-Sim, it far exceeds the latest work not only in terms of motion trajectories (1,000 vs 1 million) and number of objects (20 vs 3266) , In addition, it also introduces interactive information that is close to the real situation (such as when the robot arm is close enough to the object, the human will stop the movement and wait for the handover to be completed), rather than simple trajectory playback. Although the data generated by simulation is not completely realistic, experimental results show that large-scale simulation data is more conducive to learning than small-scale real data.
B. Large-scale generation of expert examples that are beneficial to distillation
Based on large-scale human hand and object motion trajectory data ,GenH2R automatically generates a large number of expert ,examples. The "experts" GenH2R seeks are improved Motion Planners (such as OMG Planner). These methods are non-learning, control-optimized, and do not rely on visual point clouds. They often require some scene states (such as the target grabbing position of the object). ). In order to ensure that the subsequent visual policy network can distill information beneficial to learning, the key is to ensure that the examples provided by the “experts” have vision-action correlation. If the final landing point is known during planning, the robotic arm can ignore vision and directly plan to the final position to "wait and wait". This may cause the robot's camera to be unable to see the object. This example does not help the downstream visual strategy network; If the robot arm is frequently replanned based on the position of the object, it may cause the robot arm to move discontinuously and appear in strange shapes, making it impossible to complete reasonable grasping.
To generate Distillation-friendly expert examples, GenH2R introduces Landmark Planning. The movement trajectory of the human hand will be divided into multiple segments according to the smoothness and distance of the trajectory, with Landmark as the segmentation mark. In each segment, the human hand trajectory is smooth and the expert method plans towards the Landmark points. This approach ensures both visual-action correlation and action continuity.
C. Prediction-assisted 4D imitation learning network
Based on large-scale expert examples, GenH2R uses imitation learning methods to build a 4D policy network to decompose the observed time series point cloud information into geometry and motion. For each frame point cloud, the pose transformation between the point cloud of the previous frame and the iterative closest point algorithm is calculated to estimate the flow information of each point, so that the point cloud of each frame All have movement characteristics. Then, PointNet is used to encode each frame of point cloud, and finally not only decodes the final required 6D egocentric action, but also outputs a prediction of the future pose of the object, enhancing the policy network's ability to predict future hand and object movements.
Different from more complex 4D Backbone (such as Transformer-based), this network architecture has fast reasoning speed and is more suitable for handing over objects This kind of human-computer interaction scenario requires low latency. At the same time, it can also effectively utilize timing information, achieving a balance between simplicity and effectiveness.
Experiment
A. Simulation environment experiment
GenH2R and The SOTA method was compared under various settings. Compared with the method of using small-scale real data for training, the method of using large-scale simulation data for training in GenH2R-Sim can achieve significant advantages (in various test sets The success rate is increased by 14% on average and the time is shortened by 13%).
In the real data test set s0, the GenH2R method can successfully hand over more complex objects, and can adjust the posture in advance to avoid frequent posture adjustments when the gripper is close to the object:
In the simulation data test set t0 (introduced by GenH2R-sim), GenH2R's method can predict the future posture of the object to achieve a more reasonable approach trajectory:
In the real data test set t1 (GenH2R-sim was introduced from HOI4D, which is about 7 times larger than the s0 test set in previous work), GenH2R's method can be generalized to unseen, Real world objects with different geometric shapes.
B. Real machine experiment
GenH2R simultaneously deploys the learned strategy to a real-world robotic arm Go up and complete the "sim-to-real" jump.
For more complex motion trajectories (such as rotation), GenH2R's strategy shows stronger adaptability; for more complex geometries, GenH2R's method can show stronger adaptability. Generalizability:
GenH2R has completed real-machine testing and user research on various handover objects, demonstrating strong robustness.
For more information on experiments and methods, please refer to the paper homepage.
Team introduction
The paper comes from Tsinghua University 3DVICI Lab, Shanghai Artificial Intelligence Laboratory and Shanghai Qizhi Research Institute. The author of the paper is Tsinghua University students Wang Zifan (co-author), Chen Junyu (co-author), Chen Ziqing and Xie Pengwei, the instructors are Yi Li and Chen Rui.
Tsinghua University’s Three-Dimensional Vision Computing and Machine Intelligence Laboratory (3DVICI Lab for short) is an artificial intelligence laboratory under the Institute of Interdisciplinary Information at Tsinghua University. It was established and directed by Professor Yi Li. 3DVICI Lab aims at the most cutting-edge issues of general three-dimensional vision and intelligent robot interaction in artificial intelligence. Its research directions cover embodied perception, interaction planning and generation, human-machine collaboration, etc., and are closely related to application fields such as robotics, virtual reality, and autonomous driving. The team's research goal is to enable intelligent agents to understand and interact with the three-dimensional world. The results have been published in major top computer conferences and journals.
The above is the detailed content of Let the robot sense your 'Here you are', the Tsinghua team uses millions of scenarios to create universal human-machine handover. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



0.What does this article do? We propose DepthFM: a versatile and fast state-of-the-art generative monocular depth estimation model. In addition to traditional depth estimation tasks, DepthFM also demonstrates state-of-the-art capabilities in downstream tasks such as depth inpainting. DepthFM is efficient and can synthesize depth maps within a few inference steps. Let’s read about this work together ~ 1. Paper information title: DepthFM: FastMonocularDepthEstimationwithFlowMatching Author: MingGui, JohannesS.Fischer, UlrichPrestel, PingchuanMa, Dmytr

Imagine an artificial intelligence model that not only has the ability to surpass traditional computing, but also achieves more efficient performance at a lower cost. This is not science fiction, DeepSeek-V2[1], the world’s most powerful open source MoE model is here. DeepSeek-V2 is a powerful mixture of experts (MoE) language model with the characteristics of economical training and efficient inference. It consists of 236B parameters, 21B of which are used to activate each marker. Compared with DeepSeek67B, DeepSeek-V2 has stronger performance, while saving 42.5% of training costs, reducing KV cache by 93.3%, and increasing the maximum generation throughput to 5.76 times. DeepSeek is a company exploring general artificial intelligence

Boston Dynamics Atlas officially enters the era of electric robots! Yesterday, the hydraulic Atlas just "tearfully" withdrew from the stage of history. Today, Boston Dynamics announced that the electric Atlas is on the job. It seems that in the field of commercial humanoid robots, Boston Dynamics is determined to compete with Tesla. After the new video was released, it had already been viewed by more than one million people in just ten hours. The old people leave and new roles appear. This is a historical necessity. There is no doubt that this year is the explosive year of humanoid robots. Netizens commented: The advancement of robots has made this year's opening ceremony look like a human, and the degree of freedom is far greater than that of humans. But is this really not a horror movie? At the beginning of the video, Atlas is lying calmly on the ground, seemingly on his back. What follows is jaw-dropping

AI is indeed changing mathematics. Recently, Tao Zhexuan, who has been paying close attention to this issue, forwarded the latest issue of "Bulletin of the American Mathematical Society" (Bulletin of the American Mathematical Society). Focusing on the topic "Will machines change mathematics?", many mathematicians expressed their opinions. The whole process was full of sparks, hardcore and exciting. The author has a strong lineup, including Fields Medal winner Akshay Venkatesh, Chinese mathematician Zheng Lejun, NYU computer scientist Ernest Davis and many other well-known scholars in the industry. The world of AI has changed dramatically. You know, many of these articles were submitted a year ago.

Earlier this month, researchers from MIT and other institutions proposed a very promising alternative to MLP - KAN. KAN outperforms MLP in terms of accuracy and interpretability. And it can outperform MLP running with a larger number of parameters with a very small number of parameters. For example, the authors stated that they used KAN to reproduce DeepMind's results with a smaller network and a higher degree of automation. Specifically, DeepMind's MLP has about 300,000 parameters, while KAN only has about 200 parameters. KAN has a strong mathematical foundation like MLP. MLP is based on the universal approximation theorem, while KAN is based on the Kolmogorov-Arnold representation theorem. As shown in the figure below, KAN has

What? Is Zootopia brought into reality by domestic AI? Exposed together with the video is a new large-scale domestic video generation model called "Keling". Sora uses a similar technical route and combines a number of self-developed technological innovations to produce videos that not only have large and reasonable movements, but also simulate the characteristics of the physical world and have strong conceptual combination capabilities and imagination. According to the data, Keling supports the generation of ultra-long videos of up to 2 minutes at 30fps, with resolutions up to 1080p, and supports multiple aspect ratios. Another important point is that Keling is not a demo or video result demonstration released by the laboratory, but a product-level application launched by Kuaishou, a leading player in the short video field. Moreover, the main focus is to be pragmatic, not to write blank checks, and to go online as soon as it is released. The large model of Ke Ling is already available in Kuaiying.

I cry to death. The world is madly building big models. The data on the Internet is not enough. It is not enough at all. The training model looks like "The Hunger Games", and AI researchers around the world are worrying about how to feed these data voracious eaters. This problem is particularly prominent in multi-modal tasks. At a time when nothing could be done, a start-up team from the Department of Renmin University of China used its own new model to become the first in China to make "model-generated data feed itself" a reality. Moreover, it is a two-pronged approach on the understanding side and the generation side. Both sides can generate high-quality, multi-modal new data and provide data feedback to the model itself. What is a model? Awaker 1.0, a large multi-modal model that just appeared on the Zhongguancun Forum. Who is the team? Sophon engine. Founded by Gao Yizhao, a doctoral student at Renmin University’s Hillhouse School of Artificial Intelligence.

Recently, the military circle has been overwhelmed by the news: US military fighter jets can now complete fully automatic air combat using AI. Yes, just recently, the US military’s AI fighter jet was made public for the first time and the mystery was unveiled. The full name of this fighter is the Variable Stability Simulator Test Aircraft (VISTA). It was personally flown by the Secretary of the US Air Force to simulate a one-on-one air battle. On May 2, U.S. Air Force Secretary Frank Kendall took off in an X-62AVISTA at Edwards Air Force Base. Note that during the one-hour flight, all flight actions were completed autonomously by AI! Kendall said - "For the past few decades, we have been thinking about the unlimited potential of autonomous air-to-air combat, but it has always seemed out of reach." However now,
