Can the two small models verify each other and directly compare with the large model? Microsoft's rStar doesn't even use CoT and fine-tuning-AI-php.cn

Check in with each other so that small models can also solve big problems.

As we all know, LLM is powerful, but its ability to perform complex reasoning is not strong enough.

For example, on the GSM8K data set, Mistral-7B can only achieve an accuracy of 36.5% even using technologies such as Chain of Thoughts (CoT). Although fine-tuning can indeed effectively improve inference capabilities, most LLMs rely on fine-tuning data that has been distilled from more powerful models such as GPT-4, or may even have been synthesized by these powerful models.

At the same time, researchers are also actively developing an auxiliary but more difficult method: using a better teacher LLM to improve reasoning ability.

In order to improve reasoning ability without a better model, a promising paradigm is to utilize the knowledge in LLM itself. For example, a method called RAP adopts a self-exploratory solution that iteratively improves the inference performance of LLM through self-rewarded feedback. Unfortunately, research shows that this paradigm has two fundamental problems.

First, LLM often has difficulty effectively exploring the solution space when performing inference. This self-exploratory approach often gets stuck in a solution space due to poor quality reasoning steps, even after multiple attempts.

Second, even if self-exploration finds high-quality inference steps, it is difficult for small versions of large language models (SLM) to discern which inference steps are of higher quality and to determine whether the final answer is correct. This makes it difficult to effectively guide self-exploration. Research shows that guided self-exploration based on basic regular rewards yields results no better than random guessing.

What’s more troublesome is that small versions of large language models (SLM) are more prone to the above two problems because their capabilities are worse. For example, GPT-4 can improve the output results through self-optimization, but it is difficult for SLM to do this, and may even cause the quality of the output results to decrease. This will seriously hinder the popularization and application of neural language models.

To address these issues, a research team from Microsoft Research Asia and Harvard University proposed Self-play muTuAl Reasoning, or rStar for short. To put it simply, this method is similar to asking two mediocre students to check each other's answers to exam papers, and ultimately improve their scores to the point where they can even compete with top academics. The team claims that rStar “increases the inference capabilities of SLM without the need for fine-tuning or better models.”

Can the two small models verify each other and directly compare with the large model? Microsofts rStar doesnt even use CoT and fine-tuning

Paper title: Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers
Paper address: https://arxiv.org/pdf/2408.06195
Code address: https://github.com/zhentingqi/rStar (To be released)

Method

In order to solve the above problems, rStar’s approach is to combine the reasoning process It is divided into two parts: solution generation and mutual verification, as shown in Figure 2.

Can the two small models verify each other and directly compare with the large model? Microsofts rStar doesnt even use CoT and fine-tuning

For the first puzzle, the team introduced a collection of rich human-like reasoning actions that allow for thorough exploration of many different reasoning task space.

For the second problem, they designed a reward function specifically for SLM, which can evaluate intermediate steps, thereby avoiding relying on their often unreliable self-evaluation.

In addition, the team also used another SLM as a discriminator to enhance the MCTS process, mutually verifying the correctness of each trajectory with the discriminator SLM.

Use MCTS Rollout to generate reasoning trajectories yourself

A rich collection of human-like reasoning actions. The core of MCTS generation lies in the action space, which defines the scope of tree exploration. Most MCTS-based methods use a single action type when building the tree. For example, the action in RAP is to ask the next sub-question, while the action in AlphaMath and MindStar is to generate the next reasoning step. However, relying on a single action type can easily lead to poor space exploration.

To solve this problem, the team reviewed the way humans perform reasoning. Different people solve problems in different ways: some people break the problem into sub-problems, others solve the problem directly, and still others rephrase the problem from another perspective. In addition, people will also adjust their methods according to the current state and choose different actions according to needs.

Inspired by the human reasoning process, the team built a richer dataset containing 5 categories of actions to maximize the potential of SLM to correctly solve complex reasoning problems.

动作 1：提议一步思路。针对给定问题，该动作会让 LLM 基于已有的推理步骤生成接下来的一步思路。

动作 2：提议余下的思路步骤。该动作与标准 CoT 一样，能实现「快速思考」，从而解决只需少量步骤的简单问题。给定已经生成的推理步骤，它会让 LLM 直接生成剩余步骤，直到得到最终答案。

动作 3：提议下一个子问题及其答案。

动作 4：再次回答这个子问题。考虑到动作 3 有可能无法正确回答对应的子问题，因此这个动作的作用是再次回答它。

动作 5：重新表述问题 / 子问题。这个新动作是以更简单的方式重新表述该问题。具体来说，这里是让 LLM 清晰列出问题陈述中的所有条件。

以上五个动作定义了一个高度多样化的动作空间 {A1, A2, A3, A4, A5}。

在每个步骤 i，MCTS 从该空间选取一个动作 a_i。然后基于当前状态（即之前生成的轨迹 x ⊕ s_1 ⊕ s_2 ⊕ ... ⊕ s_{i−1}），使用该动作 a_i 让 LLM 生成下一推理步骤 s_i。请注意某些动作需要按顺序执行。图 3 给出了一个示例。

Can the two small models verify each other and directly compare with the large model? Microsofts rStar doesnt even use CoT and fine-tuning

如表 1 所示，在提升最终推理准确度方面，每个动作都具有重要作用。

Can the two small models verify each other and directly compare with the large model? Microsofts rStar doesnt even use CoT and fine-tuning

奖励函数

MCTS 的另一个关键组件是奖励函数，其作用是评估每个动作的价值并为树的扩展提供指示。针对 SLM，该团队设计了一个简单却有效的奖励函数。他们的方法灵感来自 AlphaGo，即基于每个中间节点对最终正确答案的贡献对它们进行评分。这样一来，经常得到正确答案的动作就能获得更高奖励，它们也就更可能在未来的 MCTS 树扩展中被选取。

这里将执行动作 a 后生成的节点 s 的奖励值定义为 Q (s, a)。一开始，所有未被探索过的节点都被分配了 Q (s_i, a_i) = 0，从而实现随机的树扩展。在抵达首个端节点 n_d 时，根据其是否得到正确答案而计算一个奖励分数 Q (s_d, a_d)。

然后，沿轨迹 t = x ⊕ s_1 ⊕ s_2 ⊕ ... ⊕ s_d 将该分数反向传播给每个中间节点。具体来说，对于每个 s_i，都以如下方式更新其 Q 值：Q (s_i, a_i) = Q (s_i, a_i) + Q (s_d, a_d)。为了计算端节点的 Q (s_d, a_d)，这里使用的奖励值是自洽多数投票的似然（置信度）。

使用 MCTS Rollout 生成解答

下面描述 MCTS 生成候选推理轨迹的方式。从初始的根节点 s_0 开始，执行包括选取、扩展、模拟和反向传播在内的多种搜索。具体来说，模拟使用的是默认的 Rollout 策略。为了得到更准确的奖励估计，该团队会执行多次 Rollout。为了平衡探索与利用，他们使用了著名的 UCT（树的置信度上界）来选取每个节点。这个选取过程的数学形式为：

其中 N (s, a) 是之前的迭代中节点 s 被访问的次数，N_parent (s) 表示对 s 的父节点的访问次数。Q (s, a) 是估计的奖励值，会在反向传播过程中得到更新。c 是平衡探索与利用的常量。

一旦搜索到达某个端节点（可能是一个终端状态，也可能到达了预定义的最大树深度 d），便能得到一条从根到端节点的轨迹。将 Rollout 迭代得到的所有轨迹收集起来作为候选解答。接下来就需要对它们进行验证。

使用互恰性选择推理轨迹

基于收集到的所有轨迹，该团队提出使用推理互恰性来选择答案。

通过判别器 SLM 实现推理互恰性

如图 2 所示，除了目标 SLM 外，该团队还引入了一个判别器 SLM，其作用是为每个候选轨迹提供外部无监督反馈。

具体来说，对于 t = x ⊕ s_1 ⊕ s_2 ⊕ ... ⊕ s_d，遮掩从某个随机采样的步骤 i 处开始的推理步骤。然后将之前的推理轨迹 t = x ⊕ s_1 ⊕ s_2 ⊕ ... ⊕ s_{i-1} 作为 prompt 提供给判别器 SLM，让其补全剩余步骤。由于将之前的 i-1 个推理步骤作为了提示，因此难度降低了，判别器 SLM 便更有可能给出正确答案。

图 4 中比较了判别器 SLM 补全的答案是否与原始轨迹 t 匹配。如果两者一致，则认为 t 是可以最终选择的已验证轨迹。

Can the two small models verify each other and directly compare with the large model? Microsofts rStar doesnt even use CoT and fine-tuning

由目标 SLM 选取最终轨迹。在对所有候选轨迹使用了推理互恰性之后，再回到目标 SLM，让其从已验证轨迹中选出最终轨迹。为了计算每条轨迹的最终分数，该团队的做法是用其奖励乘以通过 Rollout 得到的其端节点的置信度分数。最终分数最高的轨迹被选作解答。

实验

实验设置

rStar 适用于多种 LLM 和推理任务。该团队评估了 5 个 SLM：Phi3-mini、LLaMA2-7B、Mistral-7B、LLaMA3-8B、LLaMA3-8B-Instruct。

测试的推理任务有 5 个，其中包括 4 个数学任务（GSM8K、GSM-Hard、MATH、SVAMP）和 1 个常识任务（StrategyQA）。

实验细节请访问原论文。

主要结果

该团队首先评估了 rStar 在一般推理基准上的有效性。表 2 比较了 rStar 和其它当前最佳方法在不同 SLM 和推理数据集上的准确度。为了演示新生成器的效果，该团队还提供了 rStar (generator @maj) 的准确度，即不使用判别器，仅使用多数投票来验证答案而得到的准确度。

Can the two small models verify each other and directly compare with the large model? Microsofts rStar doesnt even use CoT and fine-tuning

该团队指出了其中的三项关键结果：

1. 得到 rStar 助力的 SLM 解决问题的能力更强。比如，在 GSM8K 数据集上，使用少样本 CoT 的 LLaMA2-7B 的准确度只有 12.51%。但有了 rStar 的帮助，其准确度提升到了 63.91%，这一成绩接近使用微调得到的准确度，如图 1 所示。类似地，使用 rStar 的 Mistral 的性能甚至比微调版的 MetaMath 还高 4.18%。这样的提升表明，SLM 本身已经具备很强的推理能力，但需要引导才能生成和选出正确解答。

Can the two small models verify each other and directly compare with the large model? Microsofts rStar doesnt even use CoT and fine-tuning

2.rStar 可以稳定地将被评估的多种 SLM 在不同任务上的推理准确度提升至当前最佳水平。相较之下，其它对比方法都无法稳定地在所有四个基准上取得优良表现。举个例子，尽管 SC（自我一致性）擅长三个数学任务，但却无法有效解决 StrategyQA 的逻辑推理任务。

3. 即使没有新提出的用于验证推理轨迹的判别器，新提出的 MCTS 生成器在提升 SLM 的推理准确度方面依然效果很好。比如，在 GSM8K 数据集上，rStar (generator @maj) 的准确度比 RAP 高 2.88%-16.39%、比 ToT 高 10.60%- 38.37%、比 SC 高 1.69% - 7.34%。

在高难度数学数据集上的结果

该团队还在一个更高难度的数学数据集上评估了 rStar。为此他们选择了 GSM-Hard 和 MATH 数据集。遵照同类研究的惯例，他们使用了 MATH-500，这是来自 MATH 数据集的一个包含代表性问题的子集。这样做是为了提升评估速度。如表 2 和 3 所示，rStar 能够显著提高 SLM 在这些高难度数学数据集上的推理准确度。

Can the two small models verify each other and directly compare with the large model? Microsofts rStar doesnt even use CoT and fine-tuning

消融研究

不同 Rollout 的有效性

rStar 使用了 Rollout 策略来执行 MCTS 树扩展。更多 Rollout 会生成更多候选解答轨迹，但也会抬高推理成本。图 5 比较了在 GSM8K 上，SC、RAP 和 rStar 使用不同 Rollout 时的准确度。

Can the two small models verify each other and directly compare with the large model? Microsofts rStar doesnt even use CoT and fine-tuning

这里得到两个关键观察结果：

1. 即使仅 2 次 Rollout，rStar 也能大幅提升 SLM 的推理准确度，这表明了其有效性；

2.Rollout 更多时对 rStar 和 SC 都有利，而 RAP 在 4 次 Rollout 之后往往会饱和甚至下降。一个原因是 RAP 的单类型动作空间会限制 MCTS 探索的效果。

MCTS 生成器的有效性

该团队比较了 MCTS 生成器与其它三种生成器的效果。如表 4 所示，新提出的 MCTS 生成器全面胜过其它生成器。此外，针对 SLM 调整过的奖励函数的有效性也得到了证明，因为自我评估会降低新生成器的准确度。

Can the two small models verify each other and directly compare with the large model? Microsofts rStar doesnt even use CoT and fine-tuning

判别器的有效性

该团队设置了两个评估实验。

第一个实验是将判别方法与多数投票和自我验证方法进行比较。结果见表 5（左），可以看到判别方法的优势非常显着。

Can the two small models verify each other and directly compare with the large model? Microsofts rStar doesnt even use CoT and fine-tuning

第二个实验则是研究不同的判别器模型的影响。结果见表 5（右），可以看到选择不同的判别器模型通常不会影响推理互恰性方法验证答案的效果。值得注意的是，即使使用强大的 GPT-4 作为判别器，性能也只有略微提升（从 91.13% 提升到 92.57%）。这表明推理互恰性方法可以有效地使用 SLM 来验证答案。

The above is the detailed content of Can the two small models verify each other and directly compare with the large model? Microsoft's rStar doesn't even use CoT and fine-tuning. For more information, please follow other related articles on the PHP Chinese website!