OpenAI's o1 model, unveiled in September 2024, showcased "advanced reasoning" capabilities through large-scale reinforcement learning. DeepSeek, an AI research lab, has successfully replicated this behavior and openly published their methodology. This article explores the core concepts and underlying mechanisms of this breakthrough.
OpenAI's o1 model revolutionized Large Language Model (LLM) training by introducing "thinking" tokens. These special tokens act as a scratchpad, allowing the model to systematically process problems and user queries. A key finding was the performance improvement with increased test-time compute—more generated tokens equate to better responses. The following graph (from OpenAI's blog) illustrates this:
The left plot shows the established neural scaling laws, where longer training (train-time compute) improves performance. The right plot reveals a novel scaling law: increased token generation during inference (test-time compute) enhances performance.
Thinking Tokens
o1's "thinking" tokens demarcate the model's chain of thought (CoT) reasoning. Their importance is twofold: they clearly delineate the reasoning process for UI development and provide a human-readable record of the model's thought process. While OpenAI kept the training details confidential, DeepSeek's research sheds light on this.
DeepSeek's Research
DeepSeek's January 2025 publication, "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" [2], unveiled the o1 model's secrets. They introduced DeepSeek-R1-Zero (trained solely on reinforcement learning) and DeepSeek-R1 (a blend of supervised fine-tuning (SFT) and RL). R1-Zero is crucial as it generated training data for R1 and demonstrated emergent reasoning abilities not explicitly programmed. R1-Zero discovered CoT and test-time compute scaling through RL alone.
DeepSeek-R1-Zero (RL Only)
Reinforcement learning (RL) allows models to learn through trial and error, receiving reward signals without explicit functional relationships to model parameters. Three key aspects of R1-Zero's training are highlighted:
<think></think>
and <answer></answer>
tags to structure the model's response:<code>A conversation between User and Assistant. The user asks a question, and the Assistant solves it.The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think></think> and <answer></answer> tags, respectively, i.e., <think> reasoning process here </think><answer> answer here </answer>. User: {prompt}. Assistant:</code>
The minimal prompting avoids biasing responses and allows for natural evolution during RL.
Reward Signal: A rule-based system evaluates accuracy and formatting, avoiding potential "reward hacking" issues often associated with neural reward models.
GRPO (Group Relative Policy Optimization): This RL approach aggregates responses to update model parameters, incorporating clipping and KL-divergence regularization for stable training. The loss function is shown below:
R1-Zero Results (Emergent Abilities)
Remarkably, R1-Zero implicitly learned to improve responses through test-time compute and exhibited human-like internal monologues, often including verification steps. An example is provided in the original article.
DeepSeek-R1 (SFT RL)
DeepSeek-R1 addresses R1-Zero's readability issues through a four-step training process combining SFT and RL:
SFT with reasoning data: Initial SFT uses thousands of long CoT examples to establish a reasoning framework.
R1-Zero style RL ( language consistency reward): RL training similar to R1-Zero, but with added language consistency reward.
SFT with mixed data: SFT with both reasoning and non-reasoning data to broaden the model's capabilities.
RL RLHF: Final RL training includes reasoning training and RLHF for improved helpfulness and harmlessness.
Accessing R1-Zero and R1
DeepSeek made the model weights publicly available, allowing access through various inference providers and local deployments (DeepSeek, Together, Hyperbolic, Ollama, Hugging Face).
Conclusions
o1 introduced test-time compute as a new dimension for LLM improvement. DeepSeek's replication and open publication demonstrate that reinforcement learning can independently produce models that surpass existing human knowledge limitations. This opens exciting possibilities for future scientific and technological advancements.
[Note: Links to external resources were omitted as they are not relevant to the paraphrased content and could be considered promotional.]
The above is the detailed content of How to Train LLMs to 'Think” (o1 & DeepSeek-R1). For more information, please follow other related articles on the PHP Chinese website!