The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com
Embodied intelligence is the only way to achieve general artificial intelligence. Its core is through the interaction of intelligent agents with digital space and the physical world. Complete complex tasks. In recent years, multi-modal large models and robotics technology have made great progress, and embodied intelligence has become a new focus of global technology and industrial competition. However, there is currently a lack of a review that can comprehensively analyze the current status of the development of embodied intelligence. Therefore, Pengcheng Laboratory’s Institute of Multi-Agent and Embodied Intelligence, together with researchers from Sun Yat-sen University’s HCP Laboratory, conducted a comprehensive analysis of the latest progress in embodied intelligence, and launched a global multi-modal large model era The first review of embodied intelligence. This review surveyed nearly 400 documents and conducted a comprehensive analysis of the research on embodied intelligence from multiple dimensions. This review first introduces some representative embodied robots and embodied simulation platforms, and provides an in-depth analysis of their research focus and limitations. Then, four main research contents are thoroughly analyzed: 1) Embodied Perception, 2) Embodied Interaction, 3) Embodied Intelligence and 4) Virtual to Reality Transfer, these research contents cover state-of-the-art methods, basic paradigms, and comprehensive data sets. Furthermore, the review explores the challenges faced by embodied agents in digital spaces and physical worlds, emphasizing their importance for active interaction in dynamic digital and physical environments. Finally, the review summarizes the challenges and limitations of embodied intelligence and discusses its potential future directions. This review hopes to provide a basic reference for embodied intelligence research and promote related technological innovation. In addition, this review has also released an embodied intelligence paper list on Github. Related papers and code repositories will be continuously updated, so please pay attention. T Paper Address: https://arxiv.org/pdf/2407.06886
-
1. The Past and Present of Embodied Intelligence
The concept of embodied intelligence was first proposed by Alan Turing in the Embodied Turing Test established in 1950 to determine whether an intelligent agent can show more than just Intelligence that solves abstract problems in a virtual environment (digital space) (intelligent agents are the basis of embodied intelligence, exist in digital space and the physical world, and are embodied in the form of various entities, including not only robots but also other devices. ), can also cope with the complexity and unpredictability of the physical world. Therefore, the development of embodied intelligence is regarded as a basic way to achieve general artificial intelligence. It is particularly important to delve into the complexity of embodied intelligence, assess its current development status, and consider its future trajectory. Today, embodied intelligence covers multiple key technologies such as computer vision, natural language processing, and robotics. The most representative ones are
embodied perception, embodied interaction, embodied intelligence, and virtual-to-reality migration. In embodied tasks, embodied agents must fully understand human intentions in language instructions, proactively explore the surrounding environment, comprehensively perceive multi-modal elements from virtual and physical environments, and perform appropriate operations to complete complex tasks. The rapid progress of multimodal models demonstrates greater diversity, flexibility, and generalization capabilities than traditional deep reinforcement learning methods in complex environments. Visual representations pretrained by state-of-the-art visual encoders provide precise estimates of object categories, poses, and geometries, enabling embodied models to comprehensively perceive complex and dynamic environments. Powerful large language models enable robots to better understand human language instructions and provide a feasible way to align visual and linguistic representations for embodied robots. World models demonstrate significant simulation capabilities and a good understanding of physical laws, enabling embodied models to fully understand physics and real environments. These advances enable embodied intelligence to comprehensively perceive complex environments, interact with humans naturally, and perform tasks reliably. The figure below shows the typical architecture of an embodied agent.
Embodied Agent Framework
In this review, we provide a comprehensive overview of the current progress of embodied intelligence, including: (1)
Embodied Robots —— tool Hardware solution for embodied intelligence in the physical world; (2)
Embodied Simulation Platform
- a digital space for training embodied intelligence efficiently and safely; (3) Embodied Perception - Actively perceive and synthesize 3D space Multiple sensory modalities; (4) Embodied interaction - Interact with the environment effectively and reasonably and even change the environment to complete specified tasks; (5) Embodied intelligence
- Use multi-modal large models to understand abstractions instructions and split them into a series of subtasks and then complete them step by step; (6)
Virtual to Reality Transfer - Transfer and generalize the skills learned in the digital space to the physical world. The figure below shows the system framework of embodied intelligence from digital space to physical world. This review aims to provide a comprehensive background knowledge, research trends, and technical insights on embodied intelligence.
The overall architecture of this review physical form, including robots, smart home appliances, Smart glasses and self-driving vehicles, among others. Among them, robots, as one of the most prominent embodied forms, have attracted much attention. According to different application scenarios, robots are designed in various forms to make full use of their hardware features to complete specific tasks. As shown in the figure below, embodied robots can generally be divided into: (1) fixed base robots, such as robotic arms, which are often used in laboratory automation synthesis, education, industry and other fields; (2) wheeled robots, which are highly efficient Famous for its mobility, it is widely used in logistics, warehousing and security inspections; (3) Crawler robots, with strong off-road capabilities and mobility, have shown potential in agriculture, construction and disaster scene response; (4) Quadrupeds The robot, known for its stability and adaptability, is ideal for detection in complex terrain, rescue missions and military applications. (5) Humanoid robots, with their dexterous hands as the key, are widely used in the service industry, healthcare, and collaborative environments. (6) Bionic robots perform tasks in complex and dynamic environments by simulating the effective movements and functions of natural organisms.
Different forms of embodied robots 3. Embodied Intelligence Simulation PlatformEmbodied Intelligence Simulation Platforms are crucial to embodied intelligence because they provide It provides a cost-effective experimental method, can ensure safety by simulating potentially dangerous scenarios, has the scalability to test in a variety of environments, has rapid prototyping capabilities, can provide convenience to a wider research community, and provides A controlled environment for precise research, generating data for training and evaluation, and providing a standardized benchmark for algorithm comparison. In order for the agent to interact with the environment, a realistic simulated environment must be constructed. This requires taking into account the physical characteristics of the environment, the properties of objects, and their interactions. As shown in the figure below, this review will analyze two simulation platforms: a general platform based on underlying simulation and a simulation platform based on real scenarios.
Simulation platform based on real scenes
The "North Star" of future visual perception ” is embodiment-centered visual reasoning and social intelligence. As shown in the figure below, instead of just recognizing objects in images, agents with embodied perception must move in the physical world and interact with the environment, which requires a more thorough understanding of three-dimensional space and dynamic environments. Embodied perception requires visual perception and reasoning capabilities, understanding three-dimensional relationships in a scene, and predicting and performing complex tasks based on visual information. This review introduces active visual perception, 3D visual localization, visual language navigation, non-visual perception (tactile sensors), etc. Active visual perception framework Scenes of interaction with humans and the environment. Typical embodied interaction tasks include embodied question answering and embodied grasping. As shown in the figure below, in the embodied question and answer task, the agent needs to explore the environment from a first-person perspective to collect the information needed to answer the question. An agent with autonomous exploration and decision-making capabilities must not only consider which actions to take to explore the environment, but also decide when to stop exploring to answer questions, as shown in the figure below.架 In addition to the Q & A framework 问 In addition to the interaction with humans, interaction also involves performing operations based on human instructions, such as grasping and placing objects, thereby completing smart bodies and human beings and interactions between objects. As shown, embodied grasping requires comprehensive semantic understanding, scene awareness, decision-making, and robust control planning. The embodied grasping method combines traditional robot kinematic grasping with large-scale models (such as large language models and visual language basic models), enabling agents to perform grasping tasks under multi-sensory perception, including visual active perception, language understanding and reasoning.
Language-guided interactive crawling framework 6. Embodied agent
An agent is defined as being able to sense the environment and take actions to achieve a specific goal autonomous entity. Recent advances in multimodal large models have further expanded the application of agents in real-world scenarios. When these multimodal large model-based agents are embodied into physical entities, they are able to effectively transfer their capabilities from virtual space to the physical world, thereby becoming embodied agents. In order for embodied agents to operate in the information-rich and complex real world, they have been developed with powerful multi-modal perception, interaction and planning capabilities. As shown in the figure below, in order to complete tasks, embodied agents usually involve the following processes:
(1) Decompose abstract and complex tasks into specific sub-tasks, that is, high-level embodied task planning. (2) Gradually implement these subtasks by effectively utilizing the embodied perception and embodied interaction models, or utilizing the strategy functions of the basic model, which is called low-level embodied action planning.
It’s worth noting that mission planning involves thinking before acting and therefore is often considered in the digital space. In contrast, action planning must take into account effective interactions with the environment and feed this information back to the mission planner to adjust mission planning. Therefore, it is crucial for embodied agents to align and generalize their capabilities from digital space to the physical world.模 Based on a multi -mode and large model, the body framework of the body 7. (Sim-to-Real adaptation) refers to the process of transferring abilities or behaviors learned in a simulated environment (digital space) to the real world (physical world). The process includes validating and improving the effectiveness of algorithms, models and control strategies developed in simulation to ensure that they perform stably and reliably in the physical environment. In order to achieve simulation-to-reality adaptation, embodied world models, data collection and training methods, and embodied control algorithms are three key elements. The figure below shows five different Sim-to-Real paradigms.
Five virtual-to-reality migration solutions 8. Challenges and future development directions faced some challenges and presented the Exciting future directions: (1) High-quality robot data set. Obtaining sufficient real-world robotic data remains a significant challenge. Collecting this data is time-consuming and resource-intensive. Relying solely on simulated data will exacerbate the simulation-to-reality gap problem. Creating diverse real-world robotics datasets requires close and extensive collaboration across institutions. Furthermore, developing more realistic and efficient simulators is crucial to improve the quality of simulation data. In order to build a universal embodied model that can achieve cross-scenario and cross-task applications in the field of robotics, it is necessary to build large-scale data sets and use high-quality simulated environment data to assist real-world data. (2)Effective use of human demonstration data. Efficiently utilizing human demonstration data involves leveraging human demonstrated actions and behaviors to train and improve robotic systems. This process involves collecting, processing, and learning from large-scale, high-quality data sets, with humans performing the tasks that the robot needs to learn. Therefore, it is important to effectively utilize large amounts of unstructured, multi-label and multi-modal human demonstration data combined with action label data to train embodied models that can learn a variety of tasks in a relatively short time. By efficiently leveraging human demonstration data, robotic systems can achieve higher levels of performance and adaptability, making them better able to perform complex tasks in dynamic environments. (3)Complex environment cognition. Complex environment cognition refers to the ability of embodied agents to perceive, understand, and navigate complex real-world environments in physical or virtual environments. For unstructured open environments, current work usually relies on the task decomposition mechanism of pre-trained LLM, leveraging extensive common sense knowledge for simple task planning, but lacks specific scene understanding. Enhancing knowledge transfer and generalization in complex environments is critical. A truly versatile robotic system should be able to understand and execute natural language instructions across a variety of different and unseen scenarios. This requires the development of adaptable and scalable embodied agent architectures. (4)Long-range mission execution. Executing a single command usually involves the robot performing a long-range task, such as a command such as "clean the kitchen," which involves rearranging items, sweeping the floor, wiping the table, etc. Successful completion of these tasks requires the robot to be able to plan and execute a series of low-level actions over an extended period of time. Although current high-level task planners have shown initial success, they often fall short in diverse scenarios due to a lack of adaptation to embodied tasks. Addressing this challenge requires the development of efficient planners with strong perceptual capabilities and extensive commonsense knowledge. (5)Causal relationship discovery. Existing data-driven embodied agents make decisions based on correlations within data. However, this modeling method cannot enable the model to truly understand the causal relationship between knowledge, behavior and environment, resulting in biased strategies. This makes them difficult to operate in an interpretable, robust, and reliable manner in real-world environments. Therefore, embodied intelligence needs to be driven by world knowledge and have autonomous causal reasoning capabilities. (6)Continue learning. In robotics applications, continuous learning is crucial for deploying robot learning strategies in diverse environments, but this area remains underexplored. While some recent research has explored subtopics of continuous learning, such as incremental learning, rapid motion adaptation, and human-computer interaction learning, these solutions are usually designed for a single task or platform and have not yet considered the underlying model. Open research questions and possible approaches include: 1) blending different proportions of previous data distributions when fine-tuning on the latest data to mitigate catastrophic forgetting, 2) developing efficient prototypes from previous distributions or courses for new tasks inference learning, 3) improve the training stability and sample efficiency of online learning algorithms, 4) identify principled methods for seamlessly integrating large-capacity models into control frameworks, possibly through hierarchical learning or slow-fast control, to achieve real-time reasoning. (7)Unified Assessment Benchmark. Although there are many benchmarks for evaluating low-level control strategies, they often differ significantly in their assessment skills. Furthermore, the objects and scenes included in these benchmarks are often simulator-limited. To fully evaluate embodied models, benchmarks covering multiple skills using realistic simulators are needed. In terms of high-level task planning, many benchmarks assess planning abilities through question-and-answer tasks.그러나 특히 장기 임무 수행 시에는 기획자 단독의 평가에만 의존하기보다는 상위 임무 기획자의 실행 능력과 하위 수준의 통제 전략을 종합적으로 평가하고 성공률을 측정하는 것이 보다 이상적인 접근 방식이 될 것이다. 이러한 포괄적인 접근 방식을 통해 구현된 지능형 시스템의 기능을 보다 포괄적으로 평가할 수 있습니다. 간단히 말하면, 구체화된 지능은 지능형 에이전트가 디지털 공간과 물리적 세계의 다양한 객체를 인식하고 상호 작용할 수 있도록 하여 일반 인공 지능을 구현하는 데 중요성을 보여줍니다. 이 리뷰는 구현된 지능의 개발을 촉진하는 데 영향을 미치는 구현된 로봇, 구현된 시뮬레이션 플랫폼, 구현된 인식, 구현된 상호 작용, 구현된 에이전트, 가상-현실 로봇 제어 및 향후 연구 방향에 대한 포괄적인 검토를 제공합니다. Pengcheng 연구소의 다중 에이전트 및 구현 지능 연구소 소개Pengcheng 연구소 산하 다중 에이전트 및 구현 지능 연구소는 지능 과학 및 로봇 공학 분야의 수십 명의 전문가를 모았습니다. Top young Pengcheng Cloud Brain 및 China Computing Network와 같이 독립적으로 제어 가능한 AI 인프라를 사용하는 현장의 과학자들은 다중 에이전트 협업 및 시뮬레이션 교육 플랫폼, 클라우드 협업을 구현한 다중 모드 대형 모델 및 기타 일반 기본 플랫폼을 구축하는 데 전념하고 있습니다. 산업 인터넷, 사회 거버넌스, 서비스 등 주요 애플리케이션 요구 사항을 충족할 수 있습니다. The above is the detailed content of The first in the world! Surveying nearly 400 documents, Pengcheng Laboratory & CUHK deeply analyze embodied intelligence. For more information, please follow other related articles on the PHP Chinese website!