Recently, the Fudan University Natural Language Processing Team (FudanNLP) launched a review paper on LLM-based Agents. The full text is 86 pages long and has more than 600 references! Starting from the history of AI Agent, the authors comprehensively sorted out the current status of intelligent agents based on large language models, including: the background, composition, application scenarios of LLM-based Agent, and the much-anticipated agent society. At the same time, the authors discussed prospective and open issues related to Agent, which are of great value to future development trends in related fields.
- Paper link: https://arxiv.org/pdf/2309.07864.pdf
- LLM -based Agent paper list: https://github.com/WooooDyy/LLM-Agent-Paper-List
Team members also A "one-sentence summary" will be added to each relevant paper. Welcome to the Star warehouse. For a long time, researchers have been pursuing Artificial General Intelligence (AGI) that is equivalent to or even beyond human levels. As early as the 1950s, Alan Turing extended the concept of "intelligence" to artificial entities and proposed the famous Turing test. These artificial intelligence entities are often called - agents (Agent*). The concept of "agent" originates from philosophy and describes an entity that has desires, beliefs, intentions, and the ability to take action. In the field of artificial intelligence, this term has been given a new meaning: An intelligent entity with characteristics of autonomy, reactivity, positivity and sociability. *There is no consensus on the Chinese translation of the term Agent. Some scholars translate it as agent, actor, agent or intelligent agent. This article The "agent" and "intelligent agent" appearing in both refer to Agent.
Since then, the design of agents has been a focus of the artificial intelligence community. However, past work has mainly focused on enhancing specific abilities of agents, such as symbolic reasoning or mastery of specific tasks (chess, Go, etc.). These studies focus more on algorithm design and training strategies, while ignoring the development of the inherent general capabilities of the model, such as knowledge memory, long-term planning, effective generalization, and efficient interaction. It turns out that enhancing the inherent capabilities of the model is a key factor in promoting the further development of intelligent agents.
#The emergence of large language models (LLMs) brings hope for the further development of intelligent agents. If the development route from NLP to AGI is divided into five levels: corpus, Internet, perception, embodiment, and social attributes, then the current large-scale language models have reached the second level, with Internet-scale text input and output. On this basis, if LLM-based Agents are given perception space and action space, they will reach the third and fourth levels. Furthermore, when multiple agents interact and cooperate to solve more complex tasks, or reflect social behaviors in the real world, they have the potential to reach the fifth level - agent society. The authors envision a harmonious society composed of intelligent agents in which humans can also participate. The scene is taken from the Sea Lantern Festival in "Genshin Impact". With the blessing of a large model What would an intelligent agent look like? Inspired by Darwin's "survival of the fittest" law, the authors proposed a general framework for intelligent agents based on large models. If a person wants to survive in society, he must learn to adapt to the environment, so he needs to have cognitive abilities and be able to perceive and respond to changes in the outside world. Similarly, the framework of intelligent agents also consists of three parts: Control terminal (Brain), perception terminal (Perception) and action terminal (Action).
- Control terminal: Usually composed of LLMs, it is the core of the intelligent agent. It can not only store memory and knowledge, but also undertake indispensable functions such as information processing and decision-making. It can present the process of reasoning and planning, and cope with unknown tasks well, reflecting the generalization and transferability of intelligent agents.
- Perception end: Expand the perception space of intelligent agents from pure text to include multi-modal fields such as text, vision and hearing, so that the agent can more effectively Obtain and utilize information from the surrounding environment.
- Action side: In addition to regular text output, the agent is also given the ability to embody and use tools, allowing it to better adapt to environmental changes. Feedback interacts with the environment and can even shape it.
# LLM-BASED AGENT conceptual framework contains three components: control terminals, perception, and action end (Action) ##. The authors use an example to illustrate the workflow of LLM-based Agent: when a human asks whether it will rain, the Perception end (Perception) will Instructions are converted into a representation that LLMs can understand. Then the control terminal (Brain) starts reasoning and action planning based on the current weather and weather forecasts on the Internet. Finally, the Action responds and hands the umbrella to the human.
By repeating the above process, the intelligent agent can continuously obtain feedback and interact with the environment.
The control terminal is the core component of the intelligent agent , the authors introduce its capabilities from five aspects:
Natural language interaction: Language is the medium of communication and contains rich information. Thanks to the powerful natural language generation and understanding capabilities of LLMs, intelligent agents can interact with the outside world for multiple rounds through natural language to achieve their goals. Specifically, it can be divided into two aspects:
- High-quality text generation: A large number of evaluation experiments show that LLMs can generate smooth, diverse, and novel text , controllable text. Although poor performance in individual languages, overall good multilingual skills are available.
- Understanding of the implication: In addition to the intuitively displayed content, the language may also convey information such as the speaker's intentions and preferences. The implication is that it helps agents communicate and cooperate more efficiently, and large models have already shown the potential in this regard.
Knowledge: LLMs trained based on large batches of corpus have the ability to store massive amounts of knowledge. In addition to language knowledge, common sense knowledge and professional skills knowledge are important components of LLM-based Agents. Although LLMs themselves still have problems such as outdated knowledge and hallucinations, some existing research can obtain results to a certain extent through knowledge editing or calling external knowledge bases. ease.
Memory: In the framework of this article, the memory module (Memory) stores the agent’s past observation, thinking and action sequences. Through specific memory mechanisms, agents can effectively reflect on and apply previous strategies, allowing them to draw on past experiences to adapt to unfamiliar environments. There are three methods usually used to improve memory capabilities:
- Extended Backbone architecture Length limit: Improvements are made to address the inherent sequence length limit problem of Transformers.
- Summarizing: Summarize the memory to enhance the agent's ability to extract key details from the memory.
- Compressed memory (Compressing): By compressing memory using vectors or appropriate data structures, memory retrieval efficiency can be improved.
#In addition, the memory retrieval method is also very important. Only by retrieving the appropriate content can the agent access the most relevant and accurate information.
Reasoning & Planning: Reasoning ability (Reasoning) is crucial for intelligent agents to perform complex tasks such as decision-making and analysis. Specific to LLMs, it is a series of prompting methods represented by Chain-of-Thought (CoT). Planning is a commonly used strategy when facing large challenges. It helps agents organize their thinking, set goals, and identify steps to achieve those goals. In specific implementation, planning can include two steps:
- Plan Formulation: The agent breaks down complex tasks into more manageable Subtasks. For example: one-time decomposition and then execution in sequence, step-by-step planning and execution, multi-path planning and selection of the optimal path, etc. In some scenarios that require professional knowledge, agents can be integrated with domain-specific Planner modules to enhance capabilities.
- Plan Reflection: After making a plan, you can reflect on it and evaluate its strengths and weaknesses. This kind of reflection generally comes from three aspects: using internal feedback mechanisms; getting feedback from interacting with humans; getting feedback from the environment.
Transferability & Generalization: LLMs with world knowledge endow intelligent agents with powerful migration and generalization capabilities . A good agent is not a static knowledge base, but should also have dynamic learning capabilities:
- Generalization of unknown tasks: With the scale of the model With the increase of training data, LLMs have emerged with amazing capabilities in solving unknown tasks.Large models fine-tuned with instructions perform well in zero-shot tests, achieving results that are as good as expert models on many tasks.
- In-context Learning: Large models are not only able to learn by analogy from a small number of examples in the context, but this ability can also be extended to multi-modal scenes beyond text. Provides more possibilities for agent applications in the real world.
- Continuous Learning (Continual Learning): The main challenge of continuous learning is catastrophic forgetting, that is, when the model learns a new task, it easily loses knowledge in past tasks. Intelligent agents in specialized domains should try to avoid losing knowledge in general domains.
Perception end: PerceptionHuman through multi-mode Perceive the world in a dynamic way, so researchers have the same expectations for LLM-based Agents. Multimodal perception can deepen the agent's understanding of the work environment and significantly improve its versatility. Text input: As the most basic ability of LLMs, I won’t go into details here. Visual input: LLMs themselves do not have visual perception capabilities and can only understand discrete text content. And visual input usually contains a lot of information about the world, including the properties of objects, spatial relationships, scene layout, etc. Common methods are:
- Convert visual input into corresponding text description (Image Captioning): It can be directly understood by LLMs and is interpretable high.
- Encode and represent visual information: use the paradigm of visual basic model LLMs to form a perception module, and allow the model to understand the content of different modalities through alignment operations, which can be carried out in an end-to-end manner train.
Auditory input: Hearing is also an important part of human perception. Since LLMs have excellent tool calling capabilities, an intuitive idea is that the agent can use LLMs as a control hub, calling existing tool sets or expert models in a cascade manner to perceive audio information. In addition, audio can also be visually represented through a spectrogram. Spectrograms can be used as flat images to display 2D information, so some visual processing methods can be transferred to the speech field. Other input: There is much more to information in the real world than just text, sight, and hearing. The authors hope that in the future, intelligent agents will be equipped with richer perception modules, such as touch, smell and other organs, to obtain richer attributes of target objects. At the same time, agents can also have a clear sense of the temperature, humidity, and lightness of the surrounding environment and take more environment-aware actions. In addition, the agent can also be introduced to the perception of the broader overall environment: using mature perception modules such as lidar, GPS, and inertial measurement units. ##After the brain makes analysis and decision-making, the agent Actions are also needed to adapt or change the environment:
Text output: As the most basic ability of LLMs, I will not go into details here. Tool usage: Although LLMs have excellent knowledge reserves and professional capabilities, robustness may also occur when facing specific problems Problems, hallucinations and a series of challenges. At the same time, tools, as an extension of the user's capabilities, can provide help in aspects such as professionalism, factuality, and interpretability. For example, you can use a calculator to solve math problems and a search engine to search for real-time information. In addition, tools can also expand the action space of intelligent agents. For example, multi-modal actions can be obtained by calling expert models such as speech generation and image generation. Therefore, how to make agents become excellent tool users, that is, learn how to use tools effectively, is a very important and promising direction.
Currently, the main methods of tool learning include learning from demonstrations and learning from feedback. In addition, meta-learning, course learning, etc. can also be used to provide agents with generalization capabilities in using various tools. Going one step further, intelligent agents can further learn how to make tools "self-sufficiently", thereby increasing their autonomy and independence.
Embodied action: Embodiment refers to the ability of an agent to understand, transform the environment and update its own state during the interaction between the agent and the environment. Embodied Action is regarded as the bridge between virtual intelligence and physical reality.Traditional reinforcement learning-based Agents have limitations in sample efficiency, generalization and complex problem reasoning, while LLM-based Agents enrich their capabilities by introducing large models Intrinsic knowledge enables Embodied Agents to actively perceive and influence the physical environment like humans. Depending on the degree of autonomy of the agent in the task or the complexity of the Action, there can be the following atomic Actions:
- Observation can help the intelligent agent in the environment In locating one's own position, sensing objects and items, and obtaining other environmental information;
- Manipulation is to complete some specific operations such as grabbing and pushing;
- Navigation requires the intelligent agent to change its position according to the task goal and update its status according to the environmental information.
By combining these atomic actions, agents can complete more complex tasks. For example, embodied QA tasks such as "Is the watermelon in the kitchen bigger than the bowl?" To solve this problem, the agent needs to navigate to the kitchen and derive the answer after observing the size of both. Limited by the high cost of physical world hardware and the lack of embodied data sets, current research on embodied actions is still mainly focused on the game platform "Minecraft" Waiting in a virtual sandbox environment. Therefore, on the one hand, the authors look forward to a task paradigm and evaluation standard that is closer to reality. On the other hand, they also need more exploration on the efficient construction of relevant data sets. Agent in Practice: Diverse application scenariosCurrently, LLM-based Agents have demonstrated impressive diversity and powerful performance. Familiar application examples such as AutoGPT, MetaGPT, CAMEL, and GPT Engineer are booming at an unprecedented rate. Before introducing the specific applications, the authors discussed the design principles of Agent in Practice: 1 . Help users free themselves from daily tasks and repetitive labor, reduce human work pressure, and improve the efficiency of solving tasks; 2. Users no longer need to issue explicit low-level instructions, and they can analyze completely autonomously. , planning, and problem solving; 3. After liberating the user's hands, try to liberate the brain: give full play to potential in cutting-edge scientific fields and complete innovative and exploratory work. On this basis, the application of agents can have three paradigms:
## Three application paradigms of -based Agent: single agent, multi-agent, and human-computer interaction. ##Single agent scenario
Intelligent agents that can accept human natural language commands and perform daily tasks are currently favored by users and have high practical value. The authors first elaborated on its diverse application scenarios and corresponding capabilities in the application scenario of a single intelligent agent.
In this article, the application of a single intelligent agent is divided into the following three levels: # Single proxy application scenarios: task -oriented, innovative orientation, life cycle orientation.
- In task-oriented deployments, agents help human users handle basic daily tasks. They need to have basic command understanding, task decomposition, and the ability to interact with the environment. Specifically, according to the existing task types, the actual application of agents can be divided into simulated network environments and simulated life scenarios.
- In innovation-oriented deployment, agents can demonstrate the potential for independent inquiry in cutting-edge scientific fields. Although the inherent complexity and lack of training data from specialized fields hinders the construction of intelligent agents, there is already a lot of work making progress in fields such as chemistry, materials, computers, etc.
- In lifecycle-oriented deployment, agents have the ability to continuously explore, learn and use new skills in an open world, and survive for a long time. In this section, the authors take the game "Minecraft" as an example. Since the survival challenge in the game can be considered a microcosm of the real world, many researchers have used it as a unique platform to develop and test the comprehensive capabilities of agents.
Back in 1986, Marvin Minsky made a forward-looking prediction. In The Society of Mind, he proposed a novel theory of intelligence, arguing that intelligence arises from the interaction of many smaller, function-specific agents. For example, some agents may be responsible for identifying patterns, while others may be responsible for making decisions or generating solutions. This idea has been implemented concretely with the rise of distributed artificial intelligence. Multi-Agent System, as one of the main research issues, mainly focuses on how agents can effectively coordinate and collaborate to solve problems. The author of this article divides the interaction between multi-agents into the following two forms:
## , confrontational interaction. Cooperative interaction: As the most widely deployed type in practical applications, cooperative agent systems can effectively improve tasks efficiency, and jointly improve decision-making. Specifically, according to different forms of cooperation, the authors subdivide cooperative interactions into disordered cooperation and ordered cooperation.
- When all agents freely express their views and opinions and cooperate in a non-sequential manner, it is called disordered cooperation.
- When all agents follow certain rules, such as expressing their opinions one by one in the form of an assembly line, the entire cooperation process is orderly, which is called orderly cooperation.
Adversarial interaction: Intelligent agents interact in a tit-for-tat manner. Through competition, negotiation, and debate, agents abandon their original possibly erroneous beliefs and conduct meaningful reflections on their own behavior or reasoning process, which ultimately leads to an improvement in the response quality of the entire system. Human-computer interaction scenarioHuman-Agent Interaction, As the name suggests, intelligent agents cooperate with humans to complete tasks. On the one hand, the agent's dynamic learning ability needs to be supported by communication; on the other hand, the current agent system is still insufficient in interpretability and may have problems with security, legality, etc., so it requires human participation. Regulation and supervision.
The authors divide the Human-Agent interaction into the following two modes in the paper:
to have two modes in human-computer interaction scenarios: Instructor-Executor mode vs. Equal Partnership mode.
- Instructor-Executor mode: Humans act as instructors, giving instructions and feedback; agents act as executors, step by step according to instructions Adjust and optimize. This model has been widely used in education, medical, business and other fields.
- Equal Partnership Mode: Some studies have observed that agents can show empathy in communication with humans, or participate in task execution as equals middle. Intelligent agents show potential for application in daily life and are expected to be integrated into human society in the future.
##Agent Society: From Personality to Sociality
## For a long time, researchers have been dreaming of building an "interactive artificial society." From the sandbox game "The Sims" to the "Metaverse", people's definition of simulated society can be summarized as: individuals living and interacting in an environmental environment.
In the article, the authors use a diagram to describe the conceptual framework of Agent society: # The conceptual framework of the agency society is divided into two key parts: agency and environment. ##In this frame, we can see:
- Left side part: At the individual level, agents exhibit a variety of internalized behaviors, such as planning, reasoning, and reflection. In addition, agents exhibit intrinsic personality traits that span cognitive, emotional, and personality dimensions.
- Middle part: A single agent can form a group with other individual agents to jointly demonstrate cooperation and other group behaviors, such as collaborative cooperation, etc.
- Right part: The environment can be in the form of a virtual sandbox environment or a real physical world. Elements of the environment include human actors and various available resources. For a single agent, other agents are also part of the environment.
- Overall interaction: Agents actively participate in the entire interaction process by sensing the external environment and taking actions.
##Social Behavior and Personality of Agents
The article examines the agent's performance in society from the perspective of external behavior and internal personality: From a social perspective Starting from the starting point, behavior can be divided into two levels: individual and collective:
Individual behavior forms the basis for the operation and development of the agent itself. It includes input represented by perception, output represented by action, and the agent's own internalized behavior.
-
Crowd behavior refers to the behavior that occurs when two or more agents interact spontaneously. It includes positive behaviors represented by collaboration, negative behaviors represented by conflict, and neutral behaviors such as following the herd and watching.
Including cognition, emotion and personality. Just as humans gradually develop their traits through the process of socialization, agents also exhibit so-called "human-like intelligence", which is the gradual shaping of personality through interaction with groups and environments.
Cognitive abilities: Covers the process of agents acquiring and understanding knowledge. Research shows that LLM-based agents can in some aspects Demonstrate human-like levels of deliberation and intelligence.
-
Emotional intelligence: involves subjective feelings and emotional states, such as joy, anger, sorrow, and joy, as well as the ability to show sympathy and empathy.
-
Character portrayal: In order to understand and analyze the personality characteristics of LLMs, researchers have used mature assessment methods, such as the Big Five Personality and MBTI tests, to explore the diversity of personality and complexity.
Simulated social operating environment
Agent society is not only composed of independent individuals, but also includes the environment in which they interact. The environment influences how agents perceive, act, and interact. In turn, agents also change the state of the environment through their actions and decisions. For an individual agent, the environment includes other autonomous agents, humans, and available resources. Here, the author explores three types of environments:
Because LLMs rely primarily on language as their input and output formats, text-based environments are the most natural operating platform for agents. Social phenomena and interactions are described through words, and the text environment provides semantic and background knowledge. Agents exist in such textual worlds and rely on textual resources to perceive, reason, and act.
Virtual sandbox environment: In the computer field, a sandbox refers to a controlled and isolated environment, often used for software testing and Virus analysis. The virtual sandbox environment of the agent society serves as a platform for simulating social interaction and behavioral simulation. Its main features include:
Visualization: can be used Simple 2D graphical interfaces and even complex 3D modeling are used to display the world and depict all aspects of the simulated society in an intuitive way.
-
Scalability: Various different scenarios (Web, games, etc.) can be built and deployed to conduct various experiments, providing a broad space for agents to explore.
Real physical environment: The physical environment is a tangible environment consisting of actual objects and spaces in which the agent makes observations and action. This environment introduces rich sensory input (visual, auditory, and spatial). Unlike virtual environments, physical spaces place more demands on agent behavior. That is, the agent must be adaptable in the physical environment and generate executable motion control.
The author gave an example to explain the complexity of the physical environment: imagine an intelligent agent operating a robotic arm in a factory. When operating the robotic arm, precise control of force is required to Avoid damaging objects of different materials; in addition, the agent needs to navigate in the physical workspace and adjust the movement path in time to avoid obstacles and optimize the movement trajectory of the robotic arm. #These requirements increase the complexity and challenges of agents in the physical environment. In the article, the authors believe that a simulated society should be open, persistent, situational and organized. Openness allows agents to enter and leave the simulated society autonomously; persistence means that the society has a coherent trajectory that develops over time; contextuality emphasizes the existence and operation of subjects in a specific environment; organization ensures that the simulated society has a physical world-like rules and restrictions. As for the significance of simulated society, Stanford University’s Generative Agents town provides a vivid example for everyone-Agent society can be used to explore the boundaries of group intelligence capabilities, such as Agents jointly organized a Valentine's Day party; it can also be used to accelerate social science research, such as observing communication phenomena by simulating social networks. In addition, there are also studies to explore the values behind agents by simulating ethical decision-making scenarios, and to assist decision-making by simulating the impact of policies on society. Further, the author points out that these simulations may also have certain risks, including but not limited to: harmful social phenomena; stereotypes and prejudices; privacy and security issues; over-reliance and adult Addictiveness. Prospective open questionsAt the end of the paper, The author also discusses some forward-looking open questions and provides some inspiration for readers to think about: How can the research on intelligent agents and large language models promote each other and develop together? Large models have shown strong potential in language understanding, decision-making, and generalization capabilities, and have become a key role in the agent construction process. The progress of agents has also put forward higher requirements for large models. What challenges and worries will LLM-based Agents bring? Whether intelligent agents can truly be implemented requires rigorous security assessment to avoid harm to the real world. The author summarizes more potential threats, such as: illegal abuse, risk of unemployment, impact on human well-being, etc. #What opportunities and challenges will scaling up bring? In a simulated society, increasing the number of individuals can significantly improve the credibility and authenticity of the simulation. However, as the number of agents increases, communication and message dissemination problems will become quite complex, and information distortion, misunderstanding, or hallucination will significantly reduce the efficiency of the entire simulation system. There is a debate on the Internet about whether LLM-based Agent is the appropriate path to AGI. Some researchers believe that large models represented by GPT-4 have been trained on sufficient corpus, and agents built on this basis have the potential to become the key to opening the door to AGI. But other researchers believe that auto-regressive language modeling does not show real intelligence because they only respond. A more complete modeling method, such as the World Model, can lead to AGI. #The evolution of swarm intelligence. Swarm intelligence is a process of gathering the opinions of many people and converting them into decisions. However, will true "intelligence" be produced by simply increasing the number of agents? In addition, how to coordinate individual agents to enable a society of intelligent agents to overcome "groupthink" and personal cognitive biases? Agent as a Service (AaaS). Since LLM-based Agents are more complex than the large model itself, and are more difficult for small and medium-sized enterprises or individuals to build locally, cloud vendors can consider implementing intelligent agents in the form of services, that is, Agent-as-a-Service. Like other cloud services, AaaS has the potential to provide users with high flexibility and on-demand self-service. The above is the detailed content of Fudan NLP team released an 80-page overview of large-scale model agents, providing an overview of the current situation and future of AI agents in one article. For more information, please follow other related articles on the PHP Chinese website!