


Why does In-Context Learning, driven by GPT, work? The model performs gradient descent in secret
Following BERT, researchers have noticed the potential of large-scale pre-training models, and different pre-training tasks, model architectures, training strategies, etc. have been proposed. However, BERT-type models usually have two major shortcomings: one is over-reliance on labeled data; the other is over-fitting.
Specifically, current language models tend to use a two-stage framework, that is, pre-training and fine-tuning downstream tasks, but a large number of samples are required during the fine-tuning process for downstream tasks. Otherwise, the effect is poor, but the cost of labeling data is high. There is also limited labeled data, and the model can only fit the training data distribution. However, if there is less data, it will easily lead to overfitting, which will reduce the generalization ability of the model.
As a pioneer of large models, large-scale pre-trained language models, especially GPT-3, have shown surprising ICL (In-Context Learning) capabilities. Unlike fine-tuning, which requires additional parameter updates, ICL only requires a few demonstration "input-label" pairs, and the model can predict labels even for unseen input labels. In many downstream tasks, a large GPT model can achieve quite good performance, even surpassing some small models with supervised fine-tuning.
Why ICL performs so well? In a more than 70-page paper "Language Models are Few-Shot Learners" from OpenAI, they explored ICL. The purpose is to let GPT-3 use less domain data and solve problems without fine-tuning.
As shown in the figure below, ICL includes three categories: Few-shot learning, which allows the input of several examples and a task description; One-shot learning, which only allows the input of one example and one task description A task description; Zero-shot learning does not allow the input of any examples, only a task description is allowed. The results show that ICL does not require backpropagation and only needs to put a small number of labeled samples in the context of the input text to induce GPT-3 to output answers.
#GPT-3 in-context learning
Experiments have proven that GPT-3 performs very well under Few-shot:
Although ICL has achieved great success in terms of performance, its working mechanism is still an open problem to be studied. In order to better understand how ICL works, we next introduce how a study from Peking University, Tsinghua University and other institutions explains it.
- ##Paper address: https://arxiv.org/pdf/2212.10559v2.pdf
- Project address: https://github.com/microsoft/LMOps
To better understand how ICL works, this study interprets the language model as a meta-optimizer, ICL as a meta-optimization process, and ICL as an implicit Fine-tuning, attempts to establish a link between GPT-based ICL and fine-tuning. Theoretically, the study found that Transformer's attention has a form of dual optimization based on gradient descent.
Based on this, this study proposes a new perspective to explain ICL: GPT first generates meta-gradients based on demonstration examples, and then applies these meta-gradients to the original GPT to construct ICL Model.
As shown in Figure 1, ICL and explicit fine-tuning share a dual optimization form based on gradient descent. The only difference is that ICL produces meta-gradients by forward computation, while fine-tuning computes gradients by backpropagation. Therefore, it is reasonable to understand ICL as some kind of implicit fine-tuning. The study first conducted a qualitative analysis Transformer attention in the form of relaxed linear attention to find its duality with gradient descent-based optimization. The study then compares ICL to explicit fine-tuning and establishes a link between these two forms of optimization. Based on these theoretical findings, they propose to understand ICL as an implicit fine-tuning. First of all, this study regards Transforme attention as meta-optimization and interprets ICL as a meta-optimization process: (1) A pre-trained language model based on Transformer serves as a meta-optimizer; ( 2) Generate meta-gradients based on instances through forward computation; (3) Apply meta-gradients to the original language model through attention to build ICL. Next is a comparison of ICL and fine-tuning. Across a range of settings, the study found that ICLs share many properties with fine-tuning. They organized these commonalities from the following four aspects: both perform gradient descent; the same training information; the same causal order of training examples; and both revolve around attention. Considering all these common properties between ICL and fine-tuning, this study argues that it is reasonable to understand ICL as an implicit fine-tuning. In the remainder of this paper, the study empirically compares ICL and fine-tuning from multiple aspects to provide quantitative results that support this understanding. Experimental results In addition, inspired by meta-optimization understanding, this study designed a momentum-based attention by analogy with the momentum-based gradient descent algorithm. It consistently outperforms the performance of vanilla attention. Table 2 shows the validation accuracy in ZSL (Zero-Shot Learning), ICL and fine-tuning (FT) settings on six classification datasets. Both ICL and fine-tuning achieve considerable improvements compared to ZSL, which means that the optimizations made help these downstream tasks. Furthermore, the study found that ICL performed better than fine-tuning in few-shot scenarios.
The Rec2FTP scores of 2 GPT models on 6 datasets are shown in Table 3. On average, ICL can correctly predict 87.64% of the examples from ZSL that fine-tuning can correct. These results indicate that at the prediction level, ICL can cover most of the correct fine-tuning behaviors. Table 3 also shows the average SimAOU scores for examples and layers of 2 GPT models on 6 datasets. For comparison, the study also provides a baseline metric (Random SimAOU) that calculates the similarity between ICL updates and randomly generated updates. As can be seen from the table, ICL updates are more similar to fine-tuned updates than random updates, which means that at the representation level, ICL tends to change attention results in the direction of fine-tuned changes. Finally, Table 3 also shows the average SimAM scores for examples and layers of 2 GPT models on 6 datasets. As the baseline metric for SimAM, ZSL SimAM calculates the similarity between ICL attention weights and ZSL attention weights. By comparing the two metrics, the study found that ICL is more inclined to generate attention weights similar to fine-tuning compared to ZSL. Also at the level of attentional behavior, this study demonstrates that ICL behaves like nudges. To explore the similarities between ICL and fine-tuning more thoroughly, this study compared SimAOU and SimAM scores for different layers. By randomly sampling 50 validation examples from each dataset, SimAOU and SimAM boxplots were drawn as shown in Figure 2 and Figure 3 below, respectively. It can be found from the figure that SimAOU and SimAM fluctuate at lower layers and tend to be more stable at higher layers. This phenomenon illustrates that the meta-optimization performed by ICL has a forward accumulation effect, and as accumulation increases, ICL behaves more like higher-level fine-tuning. In conclusion, this article aims to explain the working of ICL based on GPT mechanism. Theoretically, this study finds out the dual form of ICL and proposes to understand ICL as a meta-optimization process. Furthermore, this study establishes a link between ICL and specific fine-tuning settings, finding that it is reasonable to consider ICL as an implicit fine-tuning. To support the understanding of implicit fine-tuning performed by ICL, this study comprehensively compares the behavior of ICL and real-world task-based fine-tuning. It turns out that ICL is similar to explicit fine-tuning. Furthermore, inspired by meta-optimization, this study designed a momentum-based attention to achieve consistent performance improvements. The authors hope that this study can help more people gain insights into ICL applications and model design. ICR performs implicit fine-tuning
Summary
The above is the detailed content of Why does In-Context Learning, driven by GPT, work? The model performs gradient descent in secret. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



Imagine an artificial intelligence model that not only has the ability to surpass traditional computing, but also achieves more efficient performance at a lower cost. This is not science fiction, DeepSeek-V2[1], the world’s most powerful open source MoE model is here. DeepSeek-V2 is a powerful mixture of experts (MoE) language model with the characteristics of economical training and efficient inference. It consists of 236B parameters, 21B of which are used to activate each marker. Compared with DeepSeek67B, DeepSeek-V2 has stronger performance, while saving 42.5% of training costs, reducing KV cache by 93.3%, and increasing the maximum generation throughput to 5.76 times. DeepSeek is a company exploring general artificial intelligence

AI is indeed changing mathematics. Recently, Tao Zhexuan, who has been paying close attention to this issue, forwarded the latest issue of "Bulletin of the American Mathematical Society" (Bulletin of the American Mathematical Society). Focusing on the topic "Will machines change mathematics?", many mathematicians expressed their opinions. The whole process was full of sparks, hardcore and exciting. The author has a strong lineup, including Fields Medal winner Akshay Venkatesh, Chinese mathematician Zheng Lejun, NYU computer scientist Ernest Davis and many other well-known scholars in the industry. The world of AI has changed dramatically. You know, many of these articles were submitted a year ago.

The performance of JAX, promoted by Google, has surpassed that of Pytorch and TensorFlow in recent benchmark tests, ranking first in 7 indicators. And the test was not done on the TPU with the best JAX performance. Although among developers, Pytorch is still more popular than Tensorflow. But in the future, perhaps more large models will be trained and run based on the JAX platform. Models Recently, the Keras team benchmarked three backends (TensorFlow, JAX, PyTorch) with the native PyTorch implementation and Keras2 with TensorFlow. First, they select a set of mainstream

Boston Dynamics Atlas officially enters the era of electric robots! Yesterday, the hydraulic Atlas just "tearfully" withdrew from the stage of history. Today, Boston Dynamics announced that the electric Atlas is on the job. It seems that in the field of commercial humanoid robots, Boston Dynamics is determined to compete with Tesla. After the new video was released, it had already been viewed by more than one million people in just ten hours. The old people leave and new roles appear. This is a historical necessity. There is no doubt that this year is the explosive year of humanoid robots. Netizens commented: The advancement of robots has made this year's opening ceremony look like a human, and the degree of freedom is far greater than that of humans. But is this really not a horror movie? At the beginning of the video, Atlas is lying calmly on the ground, seemingly on his back. What follows is jaw-dropping

Earlier this month, researchers from MIT and other institutions proposed a very promising alternative to MLP - KAN. KAN outperforms MLP in terms of accuracy and interpretability. And it can outperform MLP running with a larger number of parameters with a very small number of parameters. For example, the authors stated that they used KAN to reproduce DeepMind's results with a smaller network and a higher degree of automation. Specifically, DeepMind's MLP has about 300,000 parameters, while KAN only has about 200 parameters. KAN has a strong mathematical foundation like MLP. MLP is based on the universal approximation theorem, while KAN is based on the Kolmogorov-Arnold representation theorem. As shown in the figure below, KAN has

The latest video of Tesla's robot Optimus is released, and it can already work in the factory. At normal speed, it sorts batteries (Tesla's 4680 batteries) like this: The official also released what it looks like at 20x speed - on a small "workstation", picking and picking and picking: This time it is released One of the highlights of the video is that Optimus completes this work in the factory, completely autonomously, without human intervention throughout the process. And from the perspective of Optimus, it can also pick up and place the crooked battery, focusing on automatic error correction: Regarding Optimus's hand, NVIDIA scientist Jim Fan gave a high evaluation: Optimus's hand is the world's five-fingered robot. One of the most dexterous. Its hands are not only tactile

Target detection is a relatively mature problem in autonomous driving systems, among which pedestrian detection is one of the earliest algorithms to be deployed. Very comprehensive research has been carried out in most papers. However, distance perception using fisheye cameras for surround view is relatively less studied. Due to large radial distortion, standard bounding box representation is difficult to implement in fisheye cameras. To alleviate the above description, we explore extended bounding box, ellipse, and general polygon designs into polar/angular representations and define an instance segmentation mIOU metric to analyze these representations. The proposed model fisheyeDetNet with polygonal shape outperforms other models and simultaneously achieves 49.5% mAP on the Valeo fisheye camera dataset for autonomous driving

This paper explores the problem of accurately detecting objects from different viewing angles (such as perspective and bird's-eye view) in autonomous driving, especially how to effectively transform features from perspective (PV) to bird's-eye view (BEV) space. Transformation is implemented via the Visual Transformation (VT) module. Existing methods are broadly divided into two strategies: 2D to 3D and 3D to 2D conversion. 2D-to-3D methods improve dense 2D features by predicting depth probabilities, but the inherent uncertainty of depth predictions, especially in distant regions, may introduce inaccuracies. While 3D to 2D methods usually use 3D queries to sample 2D features and learn the attention weights of the correspondence between 3D and 2D features through a Transformer, which increases the computational and deployment time.
