Everything starts with the birth of ChatGPT...
The once peaceful NLP community was frightened by this sudden "monster" arrive! Overnight, the entire NLP circle has undergone tremendous changes. The industry has quickly followed suit, capital has surged, and the road to replicating ChatGPT has begun; the academic community has suddenly fallen into a state of confusion... Everyone slowly I started to believe that "NLP is solved!"
# However, judging from the NLP academic circle that is still active recently and the endless excellent work that is emerging, this is not the case, and it can even be Say "NLP just got real!"
In the past few months, Beihang University, Mila, Hong Kong University of Science and Technology, ETH Zurich (ETH), University of Waterloo, Dartmouth College, After systematic and comprehensive research, many institutions such as the University of Sheffield and the Chinese Academy of Sciences produced a 110-page paper, which systematically elaborated on the technology chain in the post-ChatGPT era: interaction.
Different from traditional types of interactions such as "Human in the Loop (HITL)" and "Writing Assistant", the interaction discussed in this article has a higher and more comprehensive perspective:
Therefore, allowing language models (LM) to interact with external entities and themselves can not only help bridge the inherent shortcomings of large models, but may also be the ultimate path to AGI. Ideal for an important milestone!
What is interaction?In fact, the concept of “interaction” is not imagined by the authors. Since the advent of ChatGPT, many papers have been published on new issues in the NLP world, such as:
It can be seen that the focus of the NLP academic community has gradually transitioned from "how to build a model" to "how to build a framework", that is, incorporating more entities into the language model During the process of training and inference. The most typical example is the well-known Reinforcement Learning from Human Feedback (RLHF). The basic principle is to let the language model learn from the interaction with humans (feedback) [7]. This idea has become the finishing touch of ChatGPT.
Therefore, it can be said that the feature of “interaction” is one of the most mainstream technical development paths for NLP after ChatGPT! The authors' paper defines and systematically deconstructs "interactive NLP" for the first time, and mainly based on the dimension of interactive objects, discusses the advantages and disadvantages of various technical solutions and application considerations as comprehensively as possible, including:
Therefore, in the interactive framework, the language model is no longer the language model itself, but a model that can be "observed" and "acted" , language-based agents that can "get feedback".
Interacting with an object, the authors call it "XXX-in-the-loop", indicating that this object participates in the process of language model training or inference, and is based on A form of cascade, loop, feedback, or iteration is involved.
##Let the language model interact with people Interaction can be broken down into three ways:
In addition, in order to ensure scalable deployment, models or programs are often used to simulate human behavior or preferences, that is, simulated from humans middle school study.
In general, the core problem to be solved in human interaction is alignment, that is, how to make the response of the language model more in line with the needs of the user and more helpful. It is harmless and well-founded, allowing users to have a better user experience.
"Use Prompts to Communicate" mainly focuses on the real-time and continuous nature of interaction, that is, it emphasizes the continuous nature of multiple rounds of dialogue. This is consistent with the idea of Conversational AI [8]. That is, through multiple rounds of dialogue, let the user continue to ask questions, so that the response of the language model slowly aligns with the user's preference during the dialogue. This approach usually does not require adjustment of model parameters during the interaction.
"Learning using feedback" is the main way of alignment currently, which is to allow users to give feedback to the language model's response. This feedback can be "good/bad" that describes preferences. ” annotation can also be more detailed feedback in the form of natural language. The model needs to be trained to make these feedbacks as high as possible. A typical example is RLHF [7] used by InstructGPT. It first uses user-labeled preference feedback data for model responses to train a reward model, and then uses this reward model to train a language model with a certain RL algorithm to maximize the reward (as shown below) ).
##Training language models to follow instructions with human feedback [7]
"Use configuration to adjust" is a special interaction method that allows users to directly adjust the hyperparameters of the language model (such as temperature), or the cascade mode of the language model, etc. A typical example is Google's AI Chains [9]. Language models with different preset prompts are connected to each other to form a reasoning chain for processing streamlined tasks. Users can drag and drop through a UI to adjust the node connection method of this chain. .
"Learning from human simulation" can promote large-scale deployment of the above three methods, because especially in the training process, using real users is unrealistic. For example, RLHF usually needs to use a reward model to simulate user preferences. Another example is Microsoft Research's ITG [10], which uses an oracle model to simulate user editing behavior.
Recently, Stanford Professor Percy Liang and others constructed a very systematic evaluation scheme for Human-LM interaction: Evaluating Human-Language Model Interaction [11], interested readers can Refer to this paper or the original text.
Interacting with the knowledge base ##There are three steps to interact between the language model and the knowledge base:
##MineDojo [16]: When a language model agent encounters a task that it does not know, it can learn from the knowledge base Find study materials, and then complete this task with the help of the materials. "Knowledge Source" is divided into two types, one is closed corpus knowledge (Corpus Knowledge), such as WikiText, etc. [15]; the other is Open network knowledge (Internet Knowledge), such as the knowledge that can be obtained using search engines [14].
"Knowledge Retrieval" is divided into four methods:
##Language models interact with models or tools, mainly The purpose is to decompose complex tasks, such as decomposing complex reasoning tasks into several sub-tasks, which is also the core idea of Chain of Thought [17]. Different subtasks can be solved using models or tools with different capabilities. For example, computing tasks can be solved using calculators, and retrieval tasks can be solved using retrieval models. Therefore, this type of interaction can not only improve the reasoning, planning, and decision making capabilities of the language model, but also alleviate the limitations of the language model such as "hallucination" and inaccurate output. In particular, when a tool is used to perform a specific sub-task, it may have a certain impact on the external world, such as using the WeChat API to post a circle of friends, etc., which is called "Tool-Oriented Learning" [ 2].
In addition, sometimes it is difficult to explicitly decompose a complex task. In this case, different roles or skills can be assigned to different language models, and then Let these language models implicitly and automatically form a division of labor during the process of mutual collaboration and communication to decompose tasks. This type of interaction can not only simplify the solution process of complex tasks, but also simulate human society and construct some form of intelligent agent society.
The authors put models and tools together, mainly because models and tools are not necessarily two separate categories. For example, a search engine tool and a retriever model are not essential. different. This essence is defined by the authors using "after task decomposition, what kind of subtasks are undertaken by what kind of objects".
When a language model interacts with a model or tool, there are three types of operations:
Note: Thinking mainly talks about the "Multi-Stage Chain-of-Thought", that is: different reasoning steps, corresponding to language Different calls to the model (multiple model run), instead of running the model once and outputting thought answer at the same time (single model run) like Vanilla CoT [17].
Part of what is inherited here is formulation of ReAct [18].
Typical thinking work includes ReAct [18], Least-to-Most Prompting [19], Self-Ask [20], etc. For example, Least-to-Most Prompting [19] first decomposes a complex problem into several simple module sub-problems, and then iteratively calls the language model to solve them one by one.
Typical work on Acting includes ReAct [18], HuggingGPT [21], Toolformer [22], etc. For example, Toolformer [22] processes the pre-training corpus of the language model into a form with tool-use prompt. Therefore, the trained language model can automatically call the correct language at the correct time when generating text. External tools (such as search engines, translation tools, time tools, calculators, etc.) solve specific sub-problems.
Collaborating mainly includes:
##Generative Agents: Interactive Simulacra of Human Behavior, https://arxiv.org/pdf/2304.03442 .pdf
Interacting with the environmentThe language model and the environment belong to two two different quadrants: the language model is built on abstract text symbols and is good at high-level reasoning, planning, decision-making and other tasks; while the environment is built on specific sensory signals (such as visual information, auditory information, etc.), and simulation Or some low-level tasks may occur naturally, such as providing observation, feedback, state transition, etc. (for example: an apple falls to the ground in the real world, and a "creeper" appears in the simulation engine. in front of you).
Therefore, to enable the language model to effectively and efficiently interact with the environment, it mainly includes two aspects of effort:
#The most typical one for Modality Grounding is the visual-language model. Generally speaking, it can be carried out using a single tower model such as OFA [28], a two-tower model such as BridgeTower [29], or the interaction of language model and visual model such as BLIP-2 [30]. No more details will be said here, readers can refer to this paper for details.
There are two main considerations for Affordance Grounding, namely: how to perform (1) scene-scale perception (scene-scale perception) under the conditions of a given task, and (2) possible action. for example:
For example, in the scene above, the given task "Please turn off the lights in the living room" and "Perception of scene scale" require us to find all the lights with red frames, instead of selecting the green ones that are not in the living room but in the kitchen. For the circled lights, "possible actions" require us to determine the feasible ways to turn off the light. For example, pulling a cord light requires a "pull" action, and turning the light on and off requires a "toggle switch" action.
Generally speaking, Affordance Grounding can be solved using a value function that depends on the environment, such as SayCan [31], etc., or a specialized grounding model such as Grounded Decoding [32] wait. It can even be solved by interacting with people, models, tools, etc. (as shown below).
Inner Monologue [33]
In the Interaction Interface chapter of the paper, the authors systematically discussed the usage, advantages and disadvantages of different interaction languages and interaction media , including:
The paper also discusses it comprehensively, in detail and systematically A variety of interaction methods, including:
Due to space limitations, this article does not detail discussions on other aspects, such as evaluation, application, ethics, security, and future development directions. However, these contents still occupy 15 pages in the original text of the paper, so readers are recommended to view more details in the original text. The following is an outline of these contents:
Evaluation of interaction
The discussion of evaluation in the paper mainly involves the following keywords:
Main applications of interactive NLP
Ethics and Safety The impact of interactive language models on education was discussed, as well as ethical and safety issues such as social bias and privacy. Future development direction and challenges
The above is the detailed content of What else can NLP do? Beihang University, ETH, Hong Kong University of Science and Technology, Chinese Academy of Sciences and other institutions jointly released a hundred-page paper to systematically explain the post-ChatGPT technology chain. For more information, please follow other related articles on the PHP Chinese website!