How does the thinking chain release the hidden capabilities of language models? The latest theoretical research reveals the mystery behind it-AI-php.cn

Table of Contents

Large models represented by GPT-4 have demonstrated astonishing mathematical capabilities. For example, it can correctly solve most high school math problems and has even become a research assistant for mathematicians.

CoT is the key to solving general decision-making problems

Home

Technology peripherals

How does the thinking chain release the hidden capabilities of language models? The latest theoretical research reveals the mystery behind it

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jun 03, 2023 pm 04:49 PM

Model thinking

One of the most mysterious phenomena in the emergence of large models is Chain of Thought Tips (CoT), which has shown amazing results especially in solving mathematical reasoning and decision-making problems. How important is CoT? What is the mechanism behind its success? In this article, several researchers from Peking University prove that CoT is indispensable in realizing large language model (LLM) inference, and reveal how CoT can unleash the huge potential of LLM from a theoretical and experimental perspective.

Recent research has found that Chain of Thought prompting (CoT) can significantly improve the performance of large language models (LLM), especially for processing complex problems involving mathematics or reasoning. Task. But despite much success, the mechanisms behind CoTs and how to unlock the potential of LLMs remain elusive.

Recently, a new study from Peking University revealed the mystery behind CoT from a theoretical perspective.

How does the thinking chain release the hidden capabilities of language models? The latest theoretical research reveals the mystery behind it

## Paper link: https://arxiv.org/abs/2305.15408

The large language model based on Transformer has become a common model in natural language processing and has been widely used in various tasks. Mainstream large models are usually implemented based on the autoregressive paradigm. Specifically, various tasks (such as text translation, text generation, question answering, etc.) can be uniformly regarded as sequence generation problems, in which the input of the question and the description of the question are encoded together into a word (token) sequence, called a prompt (prompt); the answer to the question can be transformed into the task of conditionally generating subsequent words based on the prompt.

How does the thinking chain release the hidden capabilities of language models? The latest theoretical research reveals the mystery behind it

##There is a large amount of research in the field of large models that has shown that carefully designed prompt words Plays a crucial role in the performance of the model. Especially when it comes to arithmetic or reasoning-related tasks, CoT has been shown to significantly improve the correctness of the answers generated. As shown in the figure below, for a task that requires mathematical reasoning, the answers directly generated by large models are often wrong (figures a,b below). However, if you modify the prompts so that the large model outputs the entire thinking chain (intermediate derivation steps), you will eventually be able to get the correct answer (c, d below).

How does the thinking chain release the hidden capabilities of language models? The latest theoretical research reveals the mystery behind it

##In practice, there are two mainstream ways to implement the thought chain prompt: one is to add specific Phrases such as "Let's think step by step" are triggered (as shown in c above); the other is to provide a small number of examples of thinking chain demonstrations to allow the large model to simulate the corresponding derivation process (as shown in d above).

However, although CoT has achieved remarkable performance in a large number of experiments, the theoretical mechanism behind it remains a mystery. On the one hand, do large models indeed have inherent theoretical flaws in directly answering questions about mathematics, reasoning, etc.? On the other hand, why can CoT improve the capabilities of large models on these tasks? This paper answers the above questions from a theoretical perspective.

Specifically, researchers study CoT from the perspective of model expression ability: For mathematical tasks and general decision-making tasks, this article studies the Transformer model based on autoregression in the following two Expressive ability in terms of: (1) directly generating answers, and (2) using CoT to generate complete solution steps.

CoT is the key to solving mathematical problems

Large models represented by GPT-4 have demonstrated astonishing mathematical capabilities. For example, it can correctly solve most high school math problems and has even become a research assistant for mathematicians.

In order to study the mathematical capabilities of large models, this article selected two very basic but core mathematical tasks: arithmetic and equations (the following figure shows the input of these two tasks output example). Since they are fundamental components for solving complex mathematical problems, by studying these two core mathematical problems, we can gain a deeper understanding of the capabilities of large models on general mathematical problems.

How does the thinking chain release the hidden capabilities of language models? The latest theoretical research reveals the mystery behind it

The researchers first explored whether Transformer could output answers to the above questions without outputting intermediate steps. They considered an assumption that is very consistent with reality - a log-precision Transformer, that is, each neuron of the Transformer can only represent a floating-point number of limited precision (precision is log n bits), where n is the maximum length of the sentence. This assumption is very close to reality, for example in GPT-3 the machine precision (16 or 32 bits) is usually much smaller than the maximum output sentence length (2048).

Under this assumption, the researchers proved a core impossible result: For an autoregressive Transformer model with a constant layer and a width of d, the direct output The answer is to solve the above two mathematical problems by using a very large model width d. Specifically, d needs to grow larger than the polynomial as the input length n grows.

The essential reason for this result is that there is no efficient parallel algorithm for the above two problems, so Transformer, as a typical parallel model, cannot solve them. The article uses circuit complexity theory in theoretical computer science to rigorously prove the above theorem.

So, what if the model does not output the answer directly, but outputs the intermediate derivation steps in the form of the figure above? The researchers further proved through construction that When the model can output intermediate steps, a fixed-size (not dependent on the input length n) autoregressive Transformer model can solve the above two mathematical problems.

Comparing the previous results, it can be seen that adding CoT greatly improves the expression ability of large models. The researchers further gave an intuitive understanding of this: This is because the introduction of CoT will continuously feed back the generated output words to the input layer, which greatly increases the effective depth of the model, making it proportional to the output length of CoT, thus greatly The parallel complexity of Transformer has been greatly improved.

CoT is the key to solving general decision-making problems

In addition to mathematical problems, researchers further considered CoT’s ability to solve general tasks. Starting from the decision-making problem, they considered a general framework for solving decision-making problems, called dynamic programming.

The basic idea of dynamic programming (DP) is to decompose a complex problem into a series of small-scale sub-problems that can be solved in sequence. The decomposition of the problem ensures that there is significant interrelation (overlap) between the various sub-problems, so that each sub-problem can be solved efficiently using the answers to the previous sub-problems.

The longest ascending subsequence (LIS) and the solution of the edit distance (ED) are two famous DP problems proposed in the book "Introduction to Algorithms". The following table lists these Aggregation functions of state space,transition functions for two problems.

How does the thinking chain release the hidden capabilities of language models? The latest theoretical research reveals the mystery behind it

##The researchers proved that the autoregressive Transformer model can solve the sub-problem as follows The sequence outputs a complete dynamic programming thinking chain, so that correct answers can be output for all tasks that can be solved with dynamic programming. Likewise, the researchers further demonstrated that generative thought chains are necessary: for many difficult dynamic programming problems, a constant-layer, polynomial-sized Transformer model cannot directly output the correct answer. The article gives a counterexample to the problem of context-free grammar membership testing.

Experiment

The researchers finally designed a large number of experiments to verify the above theory, considering four different tasks: evaluating arithmetic expressions, solving linear equations, and solving Longest rising subsequence and solving edit distance.

Experimental results show that when trained using CoT data, a 3-layer autoregressive Transformer model has been able to achieve almost perfect performance on all tasks. However, directly outputting the correct answer performs poorly on all tasks (even with deeper models). This result clearly demonstrates the ability of the Autoregressive Transformer to solve a variety of complex tasks and demonstrates the importance of CoT in solving these tasks.

How does the thinking chain release the hidden capabilities of language models? The latest theoretical research reveals the mystery behind it

The researchers also explored whether the learned autoregressive model could be further extrapolated to longer data. They constructed a CoT training dataset for the operation task, where the number of operators ranged from 1 to 15, and tested the model on expressions with the number of operators n ∈ {16, 17, 18}. The results are shown in Figure 3 below. The researcher's three-layer Transformer model still performs well on longer sequences, indicating that the model has indeed learned the underlying mechanism to some extent. Therefore, the researchers believe that models trained on more data of different lengths can eventually reveal the complete rules of arithmetic.

How does the thinking chain release the hidden capabilities of language models? The latest theoretical research reveals the mystery behind it

The above is the detailed content of How does the thinking chain release the hidden capabilities of language models? The latest theoretical research reveals the mystery behind it. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks ago By DDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

2 weeks ago By DDD

Where to find the Crane Control Keycard in Atomfall

3 weeks ago By DDD

Assassin's Creed Shadows - How To Find The Blacksmith And Unlock Weapon And Armour Customisation

1 months ago By DDD

Roblox: Dead Rails - How To Complete Every Challenge

3 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7613

CakePHP Tutorial

1387

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

136

Related knowledge

The world's most powerful open source MoE model is here, with Chinese capabilities comparable to GPT-4, and the price is only nearly one percent of GPT-4-Turbo May 07, 2024 pm 04:13 PM

Imagine an artificial intelligence model that not only has the ability to surpass traditional computing, but also achieves more efficient performance at a lower cost. This is not science fiction, DeepSeek-V2[1], the world’s most powerful open source MoE model is here. DeepSeek-V2 is a powerful mixture of experts (MoE) language model with the characteristics of economical training and efficient inference. It consists of 236B parameters, 21B of which are used to activate each marker. Compared with DeepSeek67B, DeepSeek-V2 has stronger performance, while saving 42.5% of training costs, reducing KV cache by 93.3%, and increasing the maximum generation throughput to 5.76 times. DeepSeek is a company exploring general artificial intelligence

AI subverts mathematical research! Fields Medal winner and Chinese-American mathematician led 11 top-ranked papers | Liked by Terence Tao Apr 09, 2024 am 11:52 AM

AI is indeed changing mathematics. Recently, Tao Zhexuan, who has been paying close attention to this issue, forwarded the latest issue of "Bulletin of the American Mathematical Society" (Bulletin of the American Mathematical Society). Focusing on the topic "Will machines change mathematics?", many mathematicians expressed their opinions. The whole process was full of sparks, hardcore and exciting. The author has a strong lineup, including Fields Medal winner Akshay Venkatesh, Chinese mathematician Zheng Lejun, NYU computer scientist Ernest Davis and many other well-known scholars in the industry. The world of AI has changed dramatically. You know, many of these articles were submitted a year ago.

Google is ecstatic: JAX performance surpasses Pytorch and TensorFlow! It may become the fastest choice for GPU inference training Apr 01, 2024 pm 07:46 PM

The performance of JAX, promoted by Google, has surpassed that of Pytorch and TensorFlow in recent benchmark tests, ranking first in 7 indicators. And the test was not done on the TPU with the best JAX performance. Although among developers, Pytorch is still more popular than Tensorflow. But in the future, perhaps more large models will be trained and run based on the JAX platform. Models Recently, the Keras team benchmarked three backends (TensorFlow, JAX, PyTorch) with the native PyTorch implementation and Keras2 with TensorFlow. First, they select a set of mainstream

Hello, electric Atlas! Boston Dynamics robot comes back to life, 180-degree weird moves scare Musk Apr 18, 2024 pm 07:58 PM

Boston Dynamics Atlas officially enters the era of electric robots! Yesterday, the hydraulic Atlas just "tearfully" withdrew from the stage of history. Today, Boston Dynamics announced that the electric Atlas is on the job. It seems that in the field of commercial humanoid robots, Boston Dynamics is determined to compete with Tesla. After the new video was released, it had already been viewed by more than one million people in just ten hours. The old people leave and new roles appear. This is a historical necessity. There is no doubt that this year is the explosive year of humanoid robots. Netizens commented: The advancement of robots has made this year's opening ceremony look like a human, and the degree of freedom is far greater than that of humans. But is this really not a horror movie? At the beginning of the video, Atlas is lying calmly on the ground, seemingly on his back. What follows is jaw-dropping

KAN, which replaces MLP, has been extended to convolution by open source projects Jun 01, 2024 pm 10:03 PM

Earlier this month, researchers from MIT and other institutions proposed a very promising alternative to MLP - KAN. KAN outperforms MLP in terms of accuracy and interpretability. And it can outperform MLP running with a larger number of parameters with a very small number of parameters. For example, the authors stated that they used KAN to reproduce DeepMind's results with a smaller network and a higher degree of automation. Specifically, DeepMind's MLP has about 300,000 parameters, while KAN only has about 200 parameters. KAN has a strong mathematical foundation like MLP. MLP is based on the universal approximation theorem, while KAN is based on the Kolmogorov-Arnold representation theorem. As shown in the figure below, KAN has

FisheyeDetNet: the first target detection algorithm based on fisheye camera Apr 26, 2024 am 11:37 AM

Target detection is a relatively mature problem in autonomous driving systems, among which pedestrian detection is one of the earliest algorithms to be deployed. Very comprehensive research has been carried out in most papers. However, distance perception using fisheye cameras for surround view is relatively less studied. Due to large radial distortion, standard bounding box representation is difficult to implement in fisheye cameras. To alleviate the above description, we explore extended bounding box, ellipse, and general polygon designs into polar/angular representations and define an instance segmentation mIOU metric to analyze these representations. The proposed model fisheyeDetNet with polygonal shape outperforms other models and simultaneously achieves 49.5% mAP on the Valeo fisheye camera dataset for autonomous driving

Tesla robots work in factories, Musk: The degree of freedom of hands will reach 22 this year! May 06, 2024 pm 04:13 PM

The latest video of Tesla's robot Optimus is released, and it can already work in the factory. At normal speed, it sorts batteries (Tesla's 4680 batteries) like this: The official also released what it looks like at 20x speed - on a small "workstation", picking and picking and picking: This time it is released One of the highlights of the video is that Optimus completes this work in the factory, completely autonomously, without human intervention throughout the process. And from the perspective of Optimus, it can also pick up and place the crooked battery, focusing on automatic error correction: Regarding Optimus's hand, NVIDIA scientist Jim Fan gave a high evaluation: Optimus's hand is the world's five-fingered robot. One of the most dexterous. Its hands are not only tactile

DualBEV: significantly surpassing BEVFormer and BEVDet4D, open the book! Mar 21, 2024 pm 05:21 PM

This paper explores the problem of accurately detecting objects from different viewing angles (such as perspective and bird's-eye view) in autonomous driving, especially how to effectively transform features from perspective (PV) to bird's-eye view (BEV) space. Transformation is implemented via the Visual Transformation (VT) module. Existing methods are broadly divided into two strategies: 2D to 3D and 3D to 2D conversion. 2D-to-3D methods improve dense 2D features by predicting depth probabilities, but the inherent uncertainty of depth predictions, especially in distant regions, may introduce inaccuracies. While 3D to 2D methods usually use 3D queries to sample 2D features and learn the attention weights of the correspondence between 3D and 2D features through a Transformer, which increases the computational and deployment time.

See all articles