


Where has the 'embodied intelligence” that Li Feifei focused on reached?
In 2009, Li Feifei, a computer scientist working at Princeton University at the time, led the construction of a data set that changed the history of artificial intelligence—ImageNet. It contains millions of labeled images that can be used to train complex machine learning models to recognize objects in images.
In 2015, machines’ recognition capabilities surpassed those of humans. Li Feifei soon turned to a new goal, to find what she called another "North Star" (the "Polaris" here refers to the key scientific problem that researchers focus on solving, which can inspire their Research enthusiasm and make breakthrough progress).
She found inspiration by looking back 530 million years to the Cambrian explosion of life, when many land animal species first appeared. One influential theory suggests that the explosion of new species was driven in part by the emergence of eyes, which allowed creatures to see the world around them for the first time. Li Feifei believes that animal vision does not arise in isolation, but is "deeply embedded in a whole that needs to move, navigate, survive, manipulate and change in a rapidly changing environment," she said, "so I It is natural to turn to a more active AI field."
Today, Li Feifei's work focuses on AI agents, which can not only receive data from A set of static images can also be moved around in a simulated environment of a three-dimensional virtual world and interact with the surrounding environment.
This is the broad goal of a new field called “embodied AI.” It overlaps with robotics in that robots can be viewed as the physical equivalent of embodied AI agents and reinforcement learning in the real world. Li Feifei and others believe that embodied AI may bring us a major transformation, from the simple ability of machine learning such as recognizing images, to learning how to perform complex human-like tasks through multiple steps, such as making frying pans. Egg rolls.
Today, the work of embodied AI includes any agent that can detect and modify its own environment. In robotics, AI agents always live in robot bodies, while agents in real simulations may have a virtual body, or may perceive the world through a moving camera position and interact with the surrounding environment . "The meaning of embodiment is not the body itself, but the overall needs and functions of interacting with the environment and doing things in the environment," Li Feifei explained.
This interactivity gives agents a new—and in many cases, better—way to understand the world. This is equivalent to the fact that before you were just observing the possible relationship between two objects, but now you can experiment and make this relationship happen yourself. With this new understanding, ideas are put into practice and greater wisdom follows. With a new set of virtual worlds up and running, embodied AI agents have begun to realize this potential, making significant progress in their new environments.
"Right now, we don't have any evidence for the existence of intelligence that doesn't learn by interacting with the world," said Viviane Clay, an embodied AI researcher at the University of Osnebruck in Germany.
Towards Perfect Simulation
Although researchers have long wanted to create real virtual worlds for AI agents to explore, they have only been created for about five years. This capability comes from improvements in graphics in the film and video game industries. In 2017, AI agents can depict interior spaces as realistically as if they were in a home—a virtual, but literal “home.” Computer scientists at the Allen Institute for Artificial Intelligence built a simulator called AI2-Thor that lets agents roam around natural kitchens, bathrooms, living rooms, and bedrooms. Agents can learn three-dimensional views that change as they move, with the simulator showing new angles when they decide to take a closer look.
This new world also gives the agent an opportunity to think about changes in a new dimension "time". "That's a big change," said Manolis Savva, a computer graphics researcher at Simon Fraser University. "In an embodied AI setting, you have these temporally coherent streams of information that you can control."
These simulated worlds are now good enough to train agents to complete completely new tasks. Not only can they recognize an object, they can interact with it, pick it up and navigate around it. These seemingly small steps are necessary for any agent to understand its environment. In 2020, virtual agents have the ability to go beyond vision and hear the sounds made by virtual things, providing a new perspective on understanding objects and how they operate in the world.
Embodied AI agents that can run in a virtual world (ManipulaTHOR environment) learn in a different way and may be more suitable for more complex, human-like Task.
However, the simulator also has its own limitations. “Even the best simulators are far less realistic than the real world,” says Daniel Yamins, a computer scientist at Stanford University. Yamins co-developed ThreeDWorld with colleagues at MIT and IBM, a project focused on simulating real-life physics in virtual worlds, such as the behavior of liquids and how some objects are rigid in one area and rigid in another. The area is flexible again.
This is a very challenging task that requires AI to learn in new ways.
Comparison with Neural Networks
So far, a simple way to measure the progress of embodied AI is to compare the performance of embodied agents with those trained on simpler static image tasks. algorithms for comparison. The researchers note that these comparisons are not perfect, but early results do suggest that embodied AI learns differently and sometimes better than their predecessors.
In a recent paper ("Interactron: Embodied Adaptive Object Detection"), researchers found that an embodied AI agent was more accurate at detecting specific objects, nearly 12% better than traditional methods . "It took more than three years for the object detection field to achieve this level of improvement," said study co-author Roozbeh Mottaghi, a computer scientist at the Allen Institute for Artificial Intelligence. "And we've achieved so much just by interacting with the world. Progress."
Other papers have shown that when you take the form of an embodied AI and have them explore a virtual space or walk around collecting multiple views of an object, the algorithm Make progress.
The researchers also found that embodied algorithms and traditional algorithms learn completely differently. To demonstrate this, consider neural networks, the fundamental ingredient behind the learning capabilities of every embodied algorithm and many disembodied algorithms. Neural networks are made up of many layers of artificial neuron nodes connected and are loosely modeled after the networks in the human brain. In two separate papers, researchers found that in neural networks of embodied agents, fewer neurons respond to visual information, meaning each individual neuron is more selective in how it responds . Disembodied networks are much less efficient, requiring more neurons to remain active most of the time. One research team (led by incoming NYU professor Grace Lindsay) even compared embodied and non-embodied neural networks with neuronal activity in a living brain (the visual cortex of mice) and found that embodied neural networks The Internet is the closest thing to a living body.
Lindsay is quick to point out that this doesn’t necessarily mean the embodied versions are better, they’re just different. Unlike the object detection paper, Lindsay et al.'s study compares the potential differences of the same neural network, allowing the agents to complete completely different tasks, so they may need neural networks that work differently to accomplish their goals.
While comparing embodied neural networks to disembodied neural networks is one way to measure improvement, what researchers really want to do is not improve the performance of embodied agents on existing tasks. , their real goal is to learn more complex, more human-like tasks. This is what excites researchers the most, and they're seeing impressive progress, especially on navigation tasks. In these tasks, the agent must remember the long-term goals of its destination while formulating a plan to get there without getting lost or bumping into objects.
In just a few years, a team led by Dhruv Batra, a research director at Meta AI and a computer scientist at the Georgia Institute of Technology, worked on a specific navigation task called "point-goal navigation." A lot of progress has been made. In this task, the agent is placed in a completely new environment and must go to a certain coordinate (such as "Go to the point that is 5 meters north and 10 meters east") without a map.
Batra said that they trained the agent in a Meta virtual world called "AI Habitat" and gave it a GPS and a compass. They found that it could obtain 99.9% on the standard data set. the above accuracy. More recently, they have successfully extended their results to a more difficult and realistic scenario - without a compass or GPS. As a result, the agent achieved 94% accuracy in estimating its position using only the stream of pixels it saw while moving.
Meta AI The "AI Habitat" virtual world created by the Dhruv Batra team. They hope to increase the speed of simulations until embodied AI can achieve 20 years of simulation experience in just 20 minutes of wall-clock time.
Mottaghi said, "This is a great improvement, but it does not mean that the navigation problem is completely solved. Because many other types of navigation tasks require the use of more complex language instructions, such as "passing the kitchen" Go get the glasses on the bedside table in your bedroom," and the accuracy is still only about 30% to 40%.
But navigation remains one of the simplest tasks in embodied AI, since the agent does not need to manipulate anything as it moves through the environment. So far, embodied AI agents are far from mastering any object-related tasks. Part of the challenge is that when an agent interacts with new objects, it can make many errors, and the errors can pile up. Currently, most researchers address this problem by choosing tasks with only a few steps, but most human-like activities, such as baking or washing dishes, require long sequences of actions on multiple objects. To achieve this goal, AI agents will need to make even greater advances.
In this regard, Fei-Fei Li may be at the forefront again, as her team developed a simulated dataset, BEHAVIOR, that it hopes will do for embodied AI what her ImageNet project did for object recognition. Make a contribution.
This data set contains more than 100 human activities for agents to complete, and the test can be completed in any virtual environment. Fei-Fei Li's team's new dataset will allow the community to better assess the progress of virtual AI agents by creating metrics that compare agents performing these tasks to real videos of humans performing the same tasks.
Once the agent successfully completes these complex tasks, Li Feifei believes that the purpose of simulation is to train for the final operable space-the real world.
"In my opinion, simulation is one of the most important and exciting areas in robotics research." Li Feifei said.
New Frontier of Robot Research
Robots are essentially embodied intelligence. They inhabit some kind of physical body in the real world and represent the most extreme form of embodied AI agent. But many researchers have found that even such agents can benefit from training in virtual worlds.
The most advanced algorithms in robotics, such as reinforcement learning, often require millions of iterations to learn something meaningful, Mottaghi said. Therefore, training real robots to perform difficult tasks can take years.
#Robots can navigate uncertain terrain in the real world. New research shows that training in virtual environments can help robots master these and other skills.
But if you train them in the virtual world first, the speed will be much faster. Thousands of agents can be trained simultaneously in thousands of different rooms. Additionally, virtual training is safer for both robots and humans.
In 2018, OpenAI researchers demonstrated that skills learned by an agent in the virtual world can be transferred to the real world, so many robotics experts began to pay more attention to simulators. They trained a robotic hand to manipulate a cube that had only been seen in simulations. Recent research also includes enabling drones to learn to avoid collisions in the air, deploying self-driving cars in urban environments on two different continents, and enabling a four-legged robot dog to complete a one-hour hike in the Swiss Alps (and It takes the same amount of time as humans).
In the future, researchers may also send humans into virtual space through virtual reality headsets, thus bridging the gap between simulation and the real world. Dieter Fox, senior director of robotics research at Nvidia and a professor at the University of Washington, pointed out that a key goal of robotics research is to build robots that are helpful to humans in the real world. But to do this, they must first be exposed to and learn how to interact with humans.
Using virtual reality technology to put humans into these simulated environments and then have them make presentations and interact with robots would be a very powerful approach, Fox said.
Whether they are in a simulation or the real world, embodied AI agents are learning to be more like humans and complete tasks that are more like human tasks. The field is advancing in all aspects, including new worlds, new tasks, and new learning algorithms.
“I see the fusion of deep learning, robot learning, vision and even language,” Li Feifei said. “Now I think that through this ‘moonshot’ or ‘North Star’ for embodied AI, we will Learning the basic technologies of intelligence can truly lead to major breakthroughs."
Li Feifei's article discusses the "Polaris" issue of computer vision. Link: https://www.amacad.org/publication/searching-computer-vision-north-stars
The above is the detailed content of Where has the 'embodied intelligence” that Li Feifei focused on reached?. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



With such a powerful AI imitation ability, it is really impossible to prevent it. It is completely impossible to prevent it. Has the development of AI reached this level now? Your front foot makes your facial features fly, and on your back foot, the exact same expression is reproduced. Staring, raising eyebrows, pouting, no matter how exaggerated the expression is, it is all imitated perfectly. Increase the difficulty, raise the eyebrows higher, open the eyes wider, and even the mouth shape is crooked, and the virtual character avatar can perfectly reproduce the expression. When you adjust the parameters on the left, the virtual avatar on the right will also change its movements accordingly to give a close-up of the mouth and eyes. The imitation cannot be said to be exactly the same, but the expression is exactly the same (far right). The research comes from institutions such as the Technical University of Munich, which proposes GaussianAvatars, which

Object detection is an important task in the field of computer vision, used to identify objects in images or videos and locate their locations. This task is usually divided into two categories of algorithms, single-stage and two-stage, which differ in terms of accuracy and robustness. Single-stage target detection algorithm The single-stage target detection algorithm converts target detection into a classification problem. Its advantage is that it is fast and can complete the detection in just one step. However, due to oversimplification, the accuracy is usually not as good as the two-stage object detection algorithm. Common single-stage target detection algorithms include YOLO, SSD and FasterR-CNN. These algorithms generally take the entire image as input and run a classifier to identify the target object. Unlike traditional two-stage target detection algorithms, they do not need to define areas in advance, but directly predict

Super-resolution image reconstruction is the process of generating high-resolution images from low-resolution images using deep learning techniques, such as convolutional neural networks (CNN) and generative adversarial networks (GAN). The goal of this method is to improve the quality and detail of images by converting low-resolution images into high-resolution images. This technology has wide applications in many fields, such as medical imaging, surveillance cameras, satellite images, etc. Through super-resolution image reconstruction, we can obtain clearer and more detailed images, which helps to more accurately analyze and identify targets and features in images. Reconstruction methods Super-resolution image reconstruction methods can generally be divided into two categories: interpolation-based methods and deep learning-based methods. 1) Interpolation-based method Super-resolution image reconstruction based on interpolation

Old photo restoration is a method of using artificial intelligence technology to repair, enhance and improve old photos. Using computer vision and machine learning algorithms, the technology can automatically identify and repair damage and flaws in old photos, making them look clearer, more natural and more realistic. The technical principles of old photo restoration mainly include the following aspects: 1. Image denoising and enhancement. When restoring old photos, they need to be denoised and enhanced first. Image processing algorithms and filters, such as mean filtering, Gaussian filtering, bilateral filtering, etc., can be used to solve noise and color spots problems, thereby improving the quality of photos. 2. Image restoration and repair In old photos, there may be some defects and damage, such as scratches, cracks, fading, etc. These problems can be solved by image restoration and repair algorithms

The Scale Invariant Feature Transform (SIFT) algorithm is a feature extraction algorithm used in the fields of image processing and computer vision. This algorithm was proposed in 1999 to improve object recognition and matching performance in computer vision systems. The SIFT algorithm is robust and accurate and is widely used in image recognition, three-dimensional reconstruction, target detection, video tracking and other fields. It achieves scale invariance by detecting key points in multiple scale spaces and extracting local feature descriptors around the key points. The main steps of the SIFT algorithm include scale space construction, key point detection, key point positioning, direction assignment and feature descriptor generation. Through these steps, the SIFT algorithm can extract robust and unique features, thereby achieving efficient image processing.

This article is reprinted with permission from the Autonomous Driving Heart public account. Please contact the source for reprinting. Original title: MotionLM: Multi-Agent Motion Forecasting as Language Modeling Paper link: https://arxiv.org/pdf/2309.16534.pdf Author affiliation: Waymo Conference: ICCV2023 Paper idea: For autonomous vehicle safety planning, reliably predict the future behavior of road agents is crucial. This study represents continuous trajectories as sequences of discrete motion tokens and treats multi-agent motion prediction as a language modeling task. The model we propose, MotionLM, has the following advantages: First

Object tracking is an important task in computer vision and is widely used in traffic monitoring, robotics, medical imaging, automatic vehicle tracking and other fields. It uses deep learning methods to predict or estimate the position of the target object in each consecutive frame in the video after determining the initial position of the target object. Object tracking has a wide range of applications in real life and is of great significance in the field of computer vision. Object tracking usually involves the process of object detection. The following is a brief overview of the object tracking steps: 1. Object detection, where the algorithm classifies and detects objects by creating bounding boxes around them. 2. Assign a unique identification (ID) to each object. 3. Track the movement of detected objects in frames while storing relevant information. Types of Target Tracking Targets

In the fields of machine learning and computer vision, image annotation is the process of applying human annotations to image data sets. Image annotation methods can be mainly divided into two categories: manual annotation and automatic annotation. Manual annotation means that human annotators annotate images through manual operations. This method requires human annotators to have professional knowledge and experience and be able to accurately identify and annotate target objects, scenes, or features in images. The advantage of manual annotation is that the annotation results are reliable and accurate, but the disadvantage is that it is time-consuming and costly. Automatic annotation refers to the method of using computer programs to automatically annotate images. This method uses machine learning and computer vision technology to achieve automatic annotation by training models. The advantages of automatic labeling are fast speed and low cost, but the disadvantage is that the labeling results may not be accurate.
