_If you are not a member but want to read this article, please check this friend link. _
Chain Thinking (CoT) has been around for some time and is technically an advanced tip engineering, but it still has significant significance even today a few years after its first introduction. All forms of CoT are often intended to force large language models to reason.
After OpenAI released its model o1 preview version in September this year, we saw that the popularity around CoT has increased.
No one knows exactly how o1 works (except OpenAI), whether it is a combinatorial system, what data it uses for fine-tuning, whether reinforcement learning is used, or whether there are several models working together.
Perhaps one model is responsible for planning, the other is responsible for thinking, and the third is responsible for evaluation. But we know they are taking some kind of step-by-step reasoning approach.
A lot of public research has been done around this issue, and you may need to delve into it. So in this post I will cover the existing methods so that you know which methods you can use. Of course, I'll test different technologies to see if we can make any real improvements.
Then, if you are keen on doing some technical work, I will help you build a system that looks at the internal confidence level of the model to generate answers.
In the past two years, many papers have been published and I have collected many of them that I have found here.
You will see the reasoning techniques they discuss in the picture below.
Most of the work comes directly from DeepMind or Princeton University. Thanks to them for opening up so much work.
The word CoT comes from DeepMind's 2022 paper, using it only in prompts, the latest paper explores thinking twice with Monte Carlo searches and CoT without prompts.
In this article, we will introduce simple chain thinking (CoT), CoT chaining, greedy decoding, CoT-SC, decoding CoT, and thinking twice (ToT) with Monte Carlo tree search.
We will also use our own set of data to understand the improvements we can make when using these inference techniques.
To understand how to improve the results of large language models, we first need to establish some kind of benchmark score.
When introducing a model, it usually comes with evaluation metrics. There are some popular indicators, such as MMLU (language comprehension), BigBench (inference), HellaSwag (common sense reasoning), and so on.
However, you should know that some of these datasets are outdated and may be a little contaminated.
Hugging Face launched a new LLM ranking in December, which is evaluated based on newer datasets, and you can clearly see that most models have much lower scores than the original dataset .
It is worth doing some research here to understand how you should consider model evaluation and what reasons you and your organization should evaluate. Having an internal private dataset for testing may not be the worst idea.
But anyway, I extracted about 350 questions from various datasets and some popular questions I found online to evaluate up to 11 different models.
I also need to understand what the answers generated by these datasets and large language models look like.
So I built my own script to loop through these questions and then evaluated the large language model using 0 or 1 of each question.
You can call me a perfectionist. You can see what I found below.
What does this tell us? Well, not much.
I used questions from Big Bench, MMLU, Putnam, and popular questions like "How many r are there in strawberries", but we can't know if these issues have been tainted by them. Furthermore, this is a fairly small dataset.
However, we can clearly see that larger models perform better.
Interestingly, can we improve these scores by applying methods that make the model reason and “think” before the answer.
Chain Thinking (CoT) Tips were proposed by DeepMind's Brain Team in 2022's paper "Chain Thinking Tips in Telling Inference in Large Language Models."
The idea of CoT has been around for quite some time.
However, this first paper is a study on how to force the model to reason about problems by activate the inherent reasoning ability of the model by using cue strategies.
At that time, people were just prompting the correct way by asking the model to "thoroughly think", which could be achieved by zero sample (not providing examples) or few sample (providing some examples).
Today, you can do this for various models such as Claude, ChatGPT, or other models by simply adding "Let's think step by step" at the end of the prompt. If you want to try less sample learning, you can provide some examples in the prompts.
DeepMind reports that they can verify that significant improvements in using CoT technology are improved by making the prompts correctly.
Since then, many papers have been built on these technologies, extending to increasingly advanced paths.
Tip many people in the engineering community use CoT-style technology to experiment. I've collected most of the repositories I've found here, so it's easy to find.
Not long ago, Benjamin Klieger highlighted something that he built a prompt-style application that uses Groq and Llama 3.1 70b to induce chain thinking by further breaking down the thinking process.
You can find his app here.
The idea is to ask a large language model to break down its thinking into chains, and it will continue to think until it is full of confidence in the answer.
The system will then continue to generate large language model calls for each part of the chain, rather than having the entire thinking process in one response.
See the example of applying this to Grok-Beta, the question is "How many Rs are there in strawberries?"
The model itself is setting up each section, naming it, and deciding whether another "idea" is needed and should continue, or whether it has reached the final answer.
This is still a CoT-style technique because it has linear relationships, but it is slightly more advanced than simply asking the model to "think step by step".
I used some of his code to build a script that loops through some benchmark problems of large language models to see how much improvements this system actually produces. I also tweaked Claude and Grok's scripts to evaluate this strategy.
You will see the percentage improvement below.
Llama 3.1 70B achieved the best improvement in the first three categories. Grok is doing worse on the pandemic (as is Haiku).
Putnam dataset is advanced math, and few large language models can do well in this regard, so when Claude Sonnet 3.5 was able to outperform o1-preview (o1-preview) at 68.75% in these CoT chains I was surprised when it was 63%).
Overall, Sonnet's use of CoT improved 81% in advanced math.
Remember, I'm using a very small dataset here, it's just to understand what they do well in and whether we can improve the score. It doesn't tell us anything specific without testing on a larger dataset.
However, I also observed that if smaller models start over-analyzing simple problems, it may produce worse results. This is evident on the popular “easier” issue of Grok-Beta and Haiku.
Easier non-mathematical problems may not get the same benefits of CoT.
We must also remember that we can push models to work within their capabilities, but rarely exceed their capabilities. If it doesn't know the answer, it doesn't know.
I would like to mention fine-tuning before continuing.
A very interesting area is trying to fine-tune smaller models on the CoT dataset to improve their accuracy, making them reach the accuracy of models 1-2 times larger.
I have found multiple resources, but unfortunately I haven't found any significant improvements to the benchmark model that I think is worthy of proper analysis.
You will see the open source model I found below.
You will see the open source CoT dataset I also found below.
This is not to say that fine-tuning for CoT will not work, it just requires building a better, well-documented model.
If you are keen to try fine-tuning yourself, check out these resources. I believe there are more resources.
So far, we have been studying the linear technique of models generating outputs in a thread (or chain).
But shortly after the publication of the first CoT paper, DeepMind proposed a more advanced technique called self-consistent chain thinking (CoT-SC).
This technique creates multiple inference paths and uses some method to select the most consistent answer (or path) at the end.
They report that using this method has achieved approximately 1-8% improvement in arithmetic reasoning.
A method just proposed this year follows the same idea of using multiple paths, but does not use any hints.
Remember the idea of greedy decoding that I discussed in the previous section?
This approach is similar, except that it not only forces the most likely marker, but also looks at the confidence scores of the entire response.
To do this, the system first starts a certain number of k initial top-level markers, and then generates a path from each marker. After the answer is generated, it calculates the confidence score by analyzing the probability (logit) of each marker in different paths.
Returns the answer (or path) with the highest probability.
This method is called decoding CoT and is proposed by DeepMind. The idea of this approach is to see the model's internal confidence in the return answer.
But what happens if it does not have the inherent knowledge to answer the question? Like CoT-SC, this approach depends largely on whether the model has the correct answer in the first place.
However, this does not mean we should not test it.
For all these technologies, there are different practical implementations in open source, and this one is no exception.
So it's easy for me to build a system to test these methods and use the smaller open source model Llama 3 8b to compare which one is better.
Thanks to Codelion for open source his implementation, which makes it easy for me to replicate.
View the results above, you can see that we use decoding CoT to produce obviously the best results compared to other methods such as entropy or using greedy decoding for that particular model only.
We will create an API in the Technical section that will use this decoding CoT system so that you can understand how it works.
It's hard to keep up, but research has gone far beyond using simple CoT for reasoning in higher risk areas.
I won't cover all of these strategies right now, because that's another topic, but I do want to mention thinking twice (ToT), especially when used in conjunction with Monte Carlo search.
ToT was proposed by Princeton University and DeepMind at the end of 2023, but is usually based on previous tree-based inference methods.
Think twice before doing (ToT) is somewhat different from self-consistent chain thinking (CoT-SC). Instead of generating multiple paths and evaluating them after they are generated, it dynamically evaluates the ideas that ToT has emerged as it progresses.
Think of it as 4 different people working together to solve the problem. At each step, they present their ideas and jointly evaluate which ideas are the most promising. If one’s reasoning seems to be flawed, they leave, so others will continue to work on their problems.
Finally, people who can reason correctly will be able to provide their answers.
This allows the model to dynamically trim paths that look bad, focusing on more promising threads, potentially saving resources.
However, one may question how the system decides which thread is right and which thread is wrong? This is determined by the model itself.
This is why extensions like Monte Carlo Tree Search (MCTS) provide a more unbiased evaluation mechanism. MCTS allows backpropagation, which means it can revisit and improve early steps based on new information, while a simple ToT will only move forward.
For 4 people problem-solving cases, MCTS will allow people to have less than ideal ideas and can still stay in the game for longer. The evaluation method will be different.
MCTS can simulate multiple future paths, evaluate their potential, and backtrack to improve early decision-making. It introduces external metrics (rewards), rather than relying entirely on the model.
Statistics like UCB (Upper Confidence Boundary) use these rewards to decide which ideas to explore further or revisit.
MCTS is a little more complicated than a simple ToT and may be written separately.
So, so far you might think, well, we have some improvements, why not always use a more advanced form of chain thinking?
OK, first of all, the cost (and time to think).
For the chains I applied to different models, I calculated the average number of inference steps.
Look at this, you pay 8 times more per question on average. For Sonnet, which performs best on advanced math problems, you will pay up to $15 per 500 questions.
This may not seem like a lot, but once you use this system to generate answers for your customer service or your team every day, you will spend hundreds or even thousands of dollars a month.
In some cases, it makes sense to use advanced reasoning methods, but not always.
There may be a case now that fine-tuning is done for CoT, essentially eliminating the need to generate multiple calls, but I haven't seen any well done open source models so far.
There are some trade-offs here. We want to increase thinking time so that the model has enough time to reason effectively, but doing so will also increase user frustration and cost.
In September this year, a paper titled "To CoT or not to CoT?" was published, which argued that most of the improvements in applying CoT were mainly reflected in mathematics and complex reasoning.
We see this here too, with limited improvements to simple questions.
When we apply these chains, we have to wait longer to get a response. Is it worth it? It should be noted, however, that all these strategies can be too complex for simple tasks.
This is why you may feel frustrated when using OpenAI's o1 in most questions, and simple answers are usually good enough.
But if you are building a system that needs to make sure the answer is correct, it might be a good idea to take some form of CoT or decoding.
It may be worth using a model to set the first step based on the difficulty of the question, and then analyze whether it is confident that it can answer it first. Then let the model infer (through the chain) and at the end let another model score the response.
Are there any more frameworks besides what I’ve introduced here? Absolutely, but I only introduce the ones that I think are interesting to understand. This will give you an idea of how much progress we have made without information overload.
Most AI engineers are well versed in these frameworks, but unfortunately, the study did not spread as quickly as expected to the public.
Learning how to implement CoT should be part of the basics of building LLM applications, even if you decide not to use them.
Let us put it into practice.
We will implement the decoding CoT system using the open source model Llama 3.1 8b.
The method of decoding CoT comes from the paper "Chain Thinking Reasoning without Tips" released this year, and the implementation was obtained from Codelion found here. I added some features so that the system checks the difficulty level to determine the number of paths (k).
Since I used Modal last time, we can use Beam this time, which is also a serverless LLM service platform. They offer a 15-hour free tier, so it's free. The script we will use can be found here.
If you prefer to use Colab for testing, you can run this script here.
The result should be an API endpoint that allows us to ask a question, which will evaluate the difficulty, and then perform a decoding CoT on the question and return the following response.
You will see the number of requests for large language models and how the system classifies the problem. You will also notice that the system is quite slow because it is generating multiple answers for evaluation.
However, if we try Groq with the same 8b model, we will find that it does not answer the question correctly.
The correct answer is 27.3, and extra points can be obtained for extra fuel.
Regarding the final answer, I will notice that although such a smaller model can only get us so far. Unfortunately, using a larger model requires more work because we need to store it somewhere, which can be expensive.
To set up this system, I will take up 5 minutes of your time. You can follow the instructions below.
We will first gain access to the model we will use. To use the Llama 3 8b model, you need to gain access to it through Hugging Face.
If you already have a Hugging Face account, this process is usually very fast. If you don't have an account, you can create an account for free and navigate to the model card.
Once we enter the model card, we might as well test the model and see which question we can use to test this new system.
This is a pretty standard question, I've used it in my evaluation before, but the standard Llama 3 8b model is difficult to handle.
After you have access, navigate to Settings to get the access token.
Save this token somewhere because we need to set it in Beam.
If you don't have a Beam account, you will need to create an account (unless you choose to use Colab directly). Of course, you can also build your own system on different platforms.
If you decide to use Beam, get the API key from its dashboard.
### Setting up the environment
Now, we can start. Open a new terminal and create a new directory, and cd to that directory.
<code>mkdir my-testing-dir cd my-testing-dir</code>
Clone the repository I set up.
<code>git clone https://github.com/ilsilfverskiold/decoding-cot-beam.git </code>
Create a virtual environment (you need to install python for this).
<code>python3 -m venv .venv && source .venv/bin/activate</code>
Installe the beam and authenticate.
<code>pip install beam-client beam configure default --token "your_token_here"</code>
Make sure you set up the HF_TOKEN we got from Hugging Face before.
<code>beam secret create HF_TOKEN</code>
You can provide services directly from here, but let's briefly introduce the code.
If you are not interested, you can skip the next section.
There are three python files in the root folder.
<code>│ ├── app.py ├── question_classifier.py └── cot_decoder.py </code>
In app.py we have code from Beam that allows us to download the weight of the model from Hugging Face (at startup) and cache it through the volume. This means that the first time we run it, it can be clumsy and slow.
Beam also allows us to load packages when the script runs remotely on Beam.
The following is the beginning of app.py with my comment:
<code>[...] # 这确保了这些包仅在脚本在 Beam 上远程运行时加载 if env.is_remote(): import torch from transformers import AutoModelForCausalLM, AutoTokenizer from cot_decoder import cot_decode from question_classifier import get_k_value # 模型参数和在卷中缓存的位置 MODEL_NAME = "meta-llama/Meta-Llama-3-8B-Instruct" CACHE_PATH = "./cached_models2" # 加载模型和标记器 def load_models(): tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, cache_dir=CACHE_PATH) tokenizer.pad_token = tokenizer.eos_token model = AutoModelForCausalLM.from_pretrained( MODEL_NAME, device_map="auto", torch_dtype=torch.float16, cache_dir=CACHE_PATH ) return model, tokenizer # 定义端点 # 您可以指定 CPU/内存/GPU + 图像 @endpoint( secrets=["HF_TOKEN"], on_start=load_models, # 启动时加载模型以进行缓存 name="meta-llama-3-8b-instruct", cpu=2, memory="32Gi", gpu="A100-40", image=Image( python_version="python3.9", python_packages=["torch", "transformers", "accelerate"], ), volumes=[Volume(name="cached_models2", mount_path=CACHE_PATH)], ) [...]</code>
We define a @endpoint with the resources we want to use (A100 GPU and 2 CPU cores). You will also see that we load the model at startup.
After receiving the API call, we will run the generate_text() function.
<code>[...] def generate_text(context: Dict[str, Any], **inputs: Dict[str, Any]) -> Dict[str, Any]: # 从 on_start 检索模型和标记器 model, tokenizer = context.on_start_value # 根据问题的复杂性获取自适应 k 值 classification_type = None if k is None: k, classification_type = get_k_value(messages, context) try: output_text, confidence, llm_calls = cot_decode( model=model, tokenizer=tokenizer, messages=messages, k=k, # 使用自适应 k 值 **inputs # 将任何其他参数直接传递给 cot_decode ) # 返回输出 return { "output": output_text, "confidence": confidence, "complexity_info": { "k": k, "total_calls": llm_calls + 1, # + 分类调用 "classification": classification_type } } except Exception as e: return {"error": f"Error during generation: {str(e)}"}</code>
We have a function that first uses get_k_value() to calculate k based on complexity. But the key function here is cot_decode(), which will perform decoding chain thinking on our problem.
This function will receive messages, models, and tokenizers and make the first initial call to predict k possible next tags using the highest logit.
logit is the original score assigned by the model for each possible next marker, letting us know the confidence score of the model for each option.
These will serve as potential starting points for generating multiple answers. For each of these starting points or starting marks, we generate a complete answer and then rate it as a whole.
Remember the greedy decoding we discussed? If the probability of the next tag is high, we only generate it? This will look at the entire sentence by calculating confidence scores that reflect the degree to which the model determines the full answer, rather than marking it one by one.
After obtaining the path with the highest confidence score, it will be returned with the k value.
There are some other options, such as adding aggregate_answers bool when the model returns multiple high confidence answers, but we are not using it here.
Now that I have briefly explained the code, we will run it to see how it works.
If you have everything set up correctly, you should be able to simply call serve.
<code>beam serve app.py:generate_text</code>
If timed out, run serve again and it will cache the model for you.
To see where the model is stored, you can go to the volume in the Beam.Cloud platform.
Once it runs, you will see the following.
This means it is ready for testing.
You can start Postman or use cURL (which means you run a call to the endpoint in a terminal window)
<code>mkdir my-testing-dir cd my-testing-dir</code>
The response should be similar to the following.
As you can see, it performs slightly better.
If you want to deploy the model, you can simply run deploy.
<code>git clone https://github.com/ilsilfverskiold/decoding-cot-beam.git </code>
I just used it to test it, so I can turn it off now.
I hope this article is educational and interesting, and you will gain something.
If you want to view results for large language models and CoT technologies, you can view this table and all the other resources you can find in this repository.
If it helps you, please leave a comment and applaud me.
❤
The above is the detailed content of Advanced Prompt Engineering: Chain of Thought (CoT). For more information, please follow other related articles on the PHP Chinese website!