Können sich die beiden kleinen Modelle gegenseitig verifizieren und direkt mit dem großen Modell vergleichen? Microsofts rStar verzichtet noch nicht einmal auf CoT und Feinabstimmung-KI-php.cn

Macht euch gegenseitig ein, damit auch kleine Modelle große Probleme lösen können.

Wie wir alle wissen, ist LLM leistungsfähig, aber seine Fähigkeit, komplexe Überlegungen anzustellen, ist nicht stark genug.

Zum Beispiel kann Mistral-7B im GSM8K-Datensatz selbst unter Verwendung von Technologien wie Chain of Thoughts (CoT) nur eine Genauigkeit von 36,5 % erreichen. Obwohl eine Feinabstimmung die Inferenzfähigkeiten tatsächlich effektiv verbessern kann, basieren die meisten LLMs auf Feinabstimmungsdaten, die aus leistungsfähigeren Modellen wie GPT-4 destilliert wurden oder möglicherweise sogar von diesen leistungsstarken Modellen synthetisiert wurden.

Gleichzeitig entwickeln Forscher auch aktiv eine zusätzliche, aber schwierigere Methode: den Einsatz eines besseren Lehrer-LLM, um die Denkfähigkeit zu verbessern.

Um die Denkfähigkeit ohne ein besseres Modell zu verbessern, besteht ein vielversprechendes Paradigma darin, das Wissen im LLM selbst zu nutzen. Beispielsweise übernimmt eine Methode namens RAP eine selbsterkundende Lösung, die die Inferenzleistung von LLM durch selbstbelohntes Feedback iterativ verbessert. Leider zeigt die Forschung, dass dieses Paradigma zwei grundlegende Probleme aufweist.

Erstens hat LLM oft Schwierigkeiten, den Lösungsraum bei der Durchführung von Inferenzen effektiv zu erkunden. Dieser selbsterkundende Ansatz bleibt aufgrund schlechter Argumentationsschritte oft in einem Lösungsraum stecken, selbst nach mehreren Versuchen.

Zweitens ist es für kleine Versionen großer Sprachmodelle (SLM) schwierig zu erkennen, welche Inferenzschritte von höherer Qualität sind, und festzustellen, ob die endgültigen Inferenzschritte vorliegen, selbst wenn die Selbsterkundung qualitativ hochwertige Inferenzschritte findet Die Antwort ist richtig. Dies macht es schwierig, die Selbsterforschung effektiv zu steuern. Untersuchungen zeigen, dass geführte Selbsterkundung auf der Grundlage grundlegender regelmäßiger Belohnungen nicht zu besseren Ergebnissen führt als zufälliges Raten.

Was noch problematischer ist, ist, dass kleine Versionen großer Sprachmodelle (SLM) anfälliger für die beiden oben genannten Probleme sind, weil ihre Fähigkeiten schlechter sind. Beispielsweise kann GPT-4 die Ausgabeergebnisse durch Selbstoptimierung verbessern, dies ist für SLM jedoch schwierig und kann sogar dazu führen, dass die Qualität der Ausgabeergebnisse abnimmt. Dies wird die Popularisierung und Anwendung neuronaler Sprachmodelle ernsthaft behindern.

Um diese Probleme anzugehen, schlug ein Forschungsteam von Microsoft Research Asia und der Harvard University Self-play muTuAl Reasoning, kurz rStar, vor. Um es einfach auszudrücken: Diese Methode ähnelt der Bitte zweier mittelmäßiger Studenten, gegenseitig ihre Antworten auf Prüfungsfragen zu überprüfen und letztendlich ihre Ergebnisse so weit zu verbessern, dass sie sogar mit Spitzenakademikern konkurrieren können. Das Team behauptet, dass rStar „die Inferenzfähigkeiten von SLM erhöht, ohne dass eine Feinabstimmung oder bessere Modelle erforderlich sind.“

Können sich die beiden kleinen Modelle gegenseitig verifizieren und direkt mit dem großen Modell vergleichen? Microsofts rStar verzichtet noch nicht einmal auf CoT und Feinabstimmung

Papiertitel: Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers
Papieradresse: https://arxiv.org/pdf/2408.06195
Codeadresse: https://github.com/zhentingqi/rStar (Wird veröffentlicht)

Methode

Um die oben genannten Probleme zu lösen, besteht der Ansatz von rStar darin Kombinieren Sie den Argumentationsprozess. Er ist in zwei Teile unterteilt: Lösungsgenerierung und gegenseitige Überprüfung, wie in Abbildung 2 dargestellt.

Können sich die beiden kleinen Modelle gegenseitig verifizieren und direkt mit dem großen Modell vergleichen? Microsofts rStar verzichtet noch nicht einmal auf CoT und Feinabstimmung

Für das erste Rätsel stellte das Team eine Sammlung umfangreicher menschenähnlicher Denkaktionen vor, die eine gründliche Erkundung vieler verschiedener Denkaufgabenbereiche ermöglichen.

Für das zweite Problem haben sie eine Belohnungsfunktion speziell für SLM entwickelt, die Zwischenschritte bewerten kann und sich so nicht auf ihre oft unzuverlässige Selbsteinschätzung verlässt.

Darüber hinaus nutzte das Team auch einen weiteren SLM als Diskriminator, um den MCTS-Prozess zu verbessern, wobei die Korrektheit jeder Flugbahn gemeinsam mit dem Diskriminator-SLM überprüft wurde.

Verwenden Sie MCTS Rollout, um selbst Argumentationspfade zu generieren

Eine reichhaltige Sammlung menschenähnlicher Argumentationsaktionen. Der Kern der MCTS-Generierung liegt im Aktionsraum, der den Umfang der Baumerkundung definiert. Die meisten MCTS-basierten Methoden verwenden beim Erstellen des Baums einen einzelnen Aktionstyp. Beispielsweise besteht die Aktion in RAP darin, die nächste Unterfrage zu stellen, während die Aktion in AlphaMath und MindStar darin besteht, den nächsten Argumentationsschritt zu generieren. Allerdings kann das Verlassen auf einen einzigen Aktionstyp leicht zu einer schlechten Weltraumerkundung führen.

Um dieses Problem zu lösen, überprüfte das Team die Art und Weise, wie Menschen denken. Verschiedene Menschen lösen Probleme auf unterschiedliche Weise: Manche Menschen zerlegen das Problem in Teilprobleme, andere lösen das Problem direkt und wieder andere formulieren das Problem aus einer anderen Perspektive neu. Darüber hinaus passen die Menschen ihre Methoden an den aktuellen Stand an und wählen je nach Bedarf unterschiedliche Aktionen.

Inspiriert durch den menschlichen Denkprozess erstellte das Team einen umfangreicheren Datensatz mit 5 Aktionskategorien, um das Potenzial von SLM zur korrekten Lösung komplexer Denkprobleme zu maximieren.

Action 1: Propose a step of thought. For a given problem, this action will cause LLM to generate the next step of thinking based on the existing reasoning steps.

Action 2: Propose the remaining steps. This action, like a standard CoT, enables "quick thinking" to solve simple problems in just a few steps. Given the generated inference steps, it lets the LLM directly generate the remaining steps until the final answer is obtained.

Action 3: Propose the next sub-question and its answer.

Action 4: Answer this sub-question again. Considering that action 3 may not answer the corresponding sub-question correctly, the role of this action is to answer it again.

Action 5: Reformulate the question/sub-question. This new move is to rephrase the problem in a simpler way. Specifically, here is having the LLM clearly list all the conditions in the problem statement.

The above five actions define a highly diverse action space {A1, A2, A3, A4, A5}.

At each step i, MCTS selects an action a_i from this space. This action a_i is then used to let the LLM generate the next inference step s_i based on the current state (i.e. the previously generated trajectory x ⊕ s_1 ⊕ s_2 ⊕ ... ⊕ s_{i−1}). Please note that some actions need to be performed in order. Figure 3 gives an example.

Können sich die beiden kleinen Modelle gegenseitig verifizieren und direkt mit dem großen Modell vergleichen? Microsofts rStar verzichtet noch nicht einmal auf CoT und Feinabstimmung

As shown in Table 1, each action plays an important role in improving the final inference accuracy. Another one of MCTS The key component is the reward function, which evaluates the value of each action and provides instructions for the expansion of the tree. For SLM, the team designed a simple yet effective reward function. Their approach, inspired by AlphaGo, scores each intermediate node based on its contribution to the final correct answer. This way, actions that frequently result in correct answers will receive higher rewards, and they will be more likely to be chosen in future MCTS tree expansions.

Here, the reward value of node s generated after executing action a is defined as Q (s, a). Initially, all unexplored nodes are assigned Q (s_i, a_i) = 0, thus achieving random tree expansion. When reaching the first end node n_d, a reward score Q (s_d, a_d) is calculated based on whether it gets the correct answer. Können sich die beiden kleinen Modelle gegenseitig verifizieren und direkt mit dem großen Modell vergleichen? Microsofts rStar verzichtet noch nicht einmal auf CoT und Feinabstimmung

Then, this score is backpropagated to each intermediate node along the trajectory t = x ⊕ s_1 ⊕ s_2 ⊕ ... ⊕ s_d. Specifically, for each s_i, its Q value is updated as follows: Q (s_i, a_i) = Q (s_i, a_i) + Q (s_d, a_d). To calculate Q(s_d, a_d) for an end node, the reward value used here is the likelihood (confidence) of a self-consistent majority vote.

Using MCTS Rollout to generate solutions

The following describes the way MCTS generates candidate inference trajectories. Starting from the initial root node s_0, various searches including selection, expansion, simulation and backpropagation are performed. Specifically, the simulation uses the default Rollout strategy. To get a more accurate reward estimate, the team performs multiple rollouts. To balance exploration and exploitation, they use the well-known UCT (upper confidence bound of the tree) to select each node. The mathematical form of this selection process is:

where N (s, a) is the number of times node s was visited in the previous iteration, and N_parent (s) represents the number of visits to the parent node of s. Q (s, a) is the estimated reward value, which is updated during backpropagation. c is a constant that balances exploration and exploitation.

Once the search reaches an end node (which may be a terminal state, or it may reach a predefined maximum tree depth d), a trajectory from the root to the end node can be obtained. All trajectories obtained by Rollout iteration are collected as candidate solutions. Next they need to be verified.

Using reciprocity to select inference trajectories

Based on all trajectories collected, the team proposes to use inference coherence to select answers.

Achieve inference coherence through discriminator SLM

As shown in Figure 2, except for the target SLM In addition, the team also introduced a discriminator SLM, whose role is to provide external unsupervised feedback for each candidate trajectory.

Specifically, for t = x ⊕ s_1 ⊕ s_2 ⊕ ... ⊕ s_d, mask the inference steps starting from some randomly sampled step i. Then the previous inference trajectory t = x ⊕ s_1 ⊕ s_2 ⊕ ... ⊕ s_{i-1} is provided to the discriminator SLM as a prompt to let it complete the remaining steps. Since the previous i-1 inference steps are used as hints, the difficulty is reduced and the discriminator SLM is more likely to give the correct answer.

Figure 4 compares whether the answer of the discriminator SLM completion matches the original trajectory t. If the two are consistent, t is considered a verified trajectory that can be finally selected.

Können sich die beiden kleinen Modelle gegenseitig verifizieren und direkt mit dem großen Modell vergleichen? Microsofts rStar verzichtet noch nicht einmal auf CoT und Feinabstimmung

The final trajectory is selected by the target SLM. After applying inference coherence on all candidate trajectories, return to the target SLM and let it select the final trajectory from the verified trajectories. To calculate the final score for each trajectory, the team multiplied its reward by the confidence score of its end node obtained via Rollout. The trajectory with the highest final score is selected as the solution.

Experiments

Experimental setup

rStar is suitable for a variety of LLM and inference tasks. The team evaluated 5 SLMs: Phi3-mini, LLaMA2-7B, Mistral-7B, LLaMA3-8B, LLaMA3-8B-Instruct.

There are 5 reasoning tasks tested, including 4 mathematical tasks (GSM8K, GSM-Hard, MATH, SVAMP) and 1 common sense task (StrategyQA).

Please visit the original paper for experimental details.

Key Results

The team first evaluated the effectiveness of rStar on a general inference benchmark. Table 2 compares the accuracy of rStar and other state-of-the-art methods on different SLM and inference datasets. To demonstrate the effectiveness of the new generator, the team also provides the accuracy of rStar (generator @maj) without a discriminator and using only majority voting to verify the answer.

Können sich die beiden kleinen Modelle gegenseitig verifizieren und direkt mit dem großen Modell vergleichen? Microsofts rStar verzichtet noch nicht einmal auf CoT und Feinabstimmung

The team pointed out three key results:

1. SLMs powered by rStar are more capable of solving problems. For example, on the GSM8K data set, the accuracy of LLaMA2-7B using few-sample CoT is only 12.51%. But with the help of rStar, its accuracy increased to 63.91%, which is close to the accuracy obtained using fine-tuning, as shown in Figure 1. Similarly, Mistral using rStar even outperformed the fine-tuned version of MetaMath by 4.18%. Such improvement shows that SLM itself already has strong reasoning capabilities, but it needs guidance to generate and select correct answers.

Können sich die beiden kleinen Modelle gegenseitig verifizieren und direkt mit dem großen Modell vergleichen? Microsofts rStar verzichtet noch nicht einmal auf CoT und Feinabstimmung

2.rStar can stably improve the inference accuracy of various SLMs evaluated on different tasks to the current best level. In comparison, other comparison methods are unable to consistently achieve good performance on all four benchmarks. For example, although SC (self-consistency) is good at three math tasks, it is not effective at solving StrategyQA's logical reasoning task.

3. Even without the newly proposed discriminator for verifying inference trajectories, the newly proposed MCTS generator still works well in improving the inference accuracy of SLM. For example, on the GSM8K data set, the accuracy of rStar (generator @maj) is 2.88%-16.39% higher than RAP, 10.60%-38.37% higher than ToT, and 1.69%-7.34% higher than SC.

Results on a difficult math data set

The team also evaluated rStar on a more difficult math data set. For this they chose GSM-Hard and MATH datasets. Following the convention of similar studies, they used MATH-500, a subset of representative problems from the MATH dataset. This is done to improve evaluation speed. As shown in Tables 2 and 3, rStar is able to significantly improve the inference accuracy of SLM on these difficult mathematical datasets.

Können sich die beiden kleinen Modelle gegenseitig verifizieren und direkt mit dem großen Modell vergleichen? Microsofts rStar verzichtet noch nicht einmal auf CoT und Feinabstimmung

Ablation study

Effectiveness of different Rollouts

rStar used the Rollout strategy to perform MCTS tree expansion. More rollouts generate more candidate solution trajectories, but also increase the cost of inference. Figure 5 compares the accuracy of SC, RAP, and rStar using different rollouts on GSM8K.

Können sich die beiden kleinen Modelle gegenseitig verifizieren und direkt mit dem großen Modell vergleichen? Microsofts rStar verzichtet noch nicht einmal auf CoT und Feinabstimmung

Two key observations are made here:

1. Even with only 2 Rollouts, rStar can significantly improve the inference accuracy of SLM, which shows its effectiveness;

2. Rollout more often Both rStar and SC are beneficial, while RAP tends to saturate or even decline after 4 Rollouts. One reason is that RAP's single-type action space limits the effectiveness of MCTS exploration.

Effectiveness of MCTS generator

The team compared the effectiveness of the MCTS generator with three other generators. As shown in Table 4, the newly proposed MCTS generator outperforms other generators across the board. Furthermore, the effectiveness of reward functions tuned for SLM is demonstrated as self-evaluation reduces the accuracy of new generators.

Können sich die beiden kleinen Modelle gegenseitig verifizieren und direkt mit dem großen Modell vergleichen? Microsofts rStar verzichtet noch nicht einmal auf CoT und Feinabstimmung

Effectiveness of the discriminator

The team set up two evaluation experiments.

The first experiment is to compare the discriminative method with majority voting and self-validation methods. The results are shown in Table 5 (left), and it can be seen that the advantages of the discrimination method are very significant.

Können sich die beiden kleinen Modelle gegenseitig verifizieren und direkt mit dem großen Modell vergleichen? Microsofts rStar verzichtet noch nicht einmal auf CoT und Feinabstimmung

The second experiment is to study the impact of different discriminator models. The results are shown in Table 5 (right). It can be seen that choosing different discriminator models usually does not affect the effectiveness of the inference coherence method to verify the answer. It is worth noting that even using the powerful GPT-4 as the discriminator, the performance only improves slightly (from 91.13% to 92.57%). This shows that the inferential coherence method can effectively use SLM to verify answers.

Das obige ist der detaillierte Inhalt vonKönnen sich die beiden kleinen Modelle gegenseitig verifizieren und direkt mit dem großen Modell vergleichen? Microsofts rStar verzichtet noch nicht einmal auf CoT und Feinabstimmung. Für weitere Informationen folgen Sie bitte anderen verwandten Artikeln auf der PHP chinesischen Website!