Unlocking Mathematical Reasoning in Language Models
Based on a video summary of the DeepSeek Math Paper by Yannic Kilcher: Watch the video
Introduction
- DeepSeek’s Rise: DeepSeek is a prominent name in AI, making this paper highly relevant despite its date.
- GRPO Highlight: Introduces Group Relative Policy Optimization (GRPO), a key component of DeepSeek’s R1 model.
- Paper Focus: Achieving state-of-the-art accuracy on math problem-solving benchmarks.
- Video Source: Summarizing a detailed video explanation and discussion.
Impressive Result: DeepSeek Math 7B
- Small Model, Big Performance: DeepSeek Math 7B parameter model achieves remarkable results.
- Outperforms Larger Models: Matches or surpasses performance of much larger models like GPT-4 and Gemini Ultra on math benchmarks.
- Math-Specific Tuning: Acknowledges GPT-4/Gemini are general models, but 7B still excels in math domain.
- Significant Improvement: Represents a major leap over both general and other math-focused models.
Two-Pronged Approach to Success
DeepSeek’s success is built on two key pillars:
- High-Quality, Large-Scale Data Collection
- Novel method to create a massive, relevant dataset for math.
- GRPO (Group Relative Policy Optimization)
- Simplified & efficient Reinforcement Learning algorithm (PPO variant).
1. Data Collection: DeepSeek Math Corpus
- DeepSeek Math Corpus: 120 Billion Math Tokens dataset.
- Scale & Quality: Order of magnitude larger than existing math datasets.
- Source: Surprisingly, extracted from Common Crawl - demonstrating readily available high-quality data.
- Iterative Pipeline: Systematic process for both scale and relevance.
Iterative Data Collection Process
Iteration 1: Seed & Classification
- Seed Corpus: Start with a small, relevant dataset (e.g., OpenWebMath).
- Often limited in size and diversity.
- FastText Classifier Training:
- Train a fast text classifier to distinguish “math-like” content.
- Positive examples: Seed Corpus (500k)
- Negative examples: Random Common Crawl pages (500k)
Iterative Data Collection Process (Cont.)
Iteration 2: Sifting & Domain Expansion
- Common Crawl Sifting:
- Classify a large, cleaned Common Crawl dataset (40B websites).
- Rank pages by classifier score and keep top 40B tokens.
- Domain Expansion:
- Group URLs by domain.
- If >10% of URLs in a domain are “math-like”, reconsider entire domain.
Iterative Data Collection Process (Cont.)
Iteration 3: Manual Annotation & Refinement
- Manual Annotation:
- Human (or potentially LLM-assisted) annotation of URLs from reconsidered domains.
- Verifies and expands math relevance, increasing diversity.
- New Seed Corpus: Annotated data becomes the seed for the next iteration.
Repeat Iterations: Process is repeated, broadening scope and improving classifier.
Iterative Data Collection Process (Cont.)
Iteration 4 & Convergence
- Process Stops: After 4 iterations, ~98% of data collected by iteration 3, indicating convergence.
- Final Dataset: 35.5M web pages, 120B tokens.
Validation:
- Trained a small model (1.3B) on DeepSeek Math Corpus vs. other math datasets.
- DeepSeek Corpus significantly outperformed on benchmarks.
- Key Features: Relevant, Large-scale, Multilingual.
2. Base Model: DeepSeek Math Base 7B
- Initialization: Initialized from DeepSeek Coder Base 1.5 7B.
- Deliberate choice: Code pre-training crucial for math reasoning.
- Archive pre-training alone less effective.
- Training Data Mix:
- 56% DeepSeek Math Corpus + Algebra Stack, Archive, GitHub, Common Crawl (NL).
- Training Duration: 500 Billion Tokens.
- Benchmark Results: Outperforms other models (including much larger ones) on Math benchmarks with Chain of Thought and Tool Use.
Instruction Fine-tuning
- Further Enhancement: Instruction fine-tuning on top of base model.
- Fine-tuning Data: Annotated datasets (GSM8K, Math problems) with:
- Tool-integrated solutions.
- Chain of Thought & Program of Thought examples.
- English & Chinese datasets.
- Training Scale: ~500 steps, small batch size (~100k data points effective).
- Performance Gain: Significant improvement, approaching closed-source models (GPT-4, Gemini Ultra).
3. Reinforcement Learning with GRPO
- Final Step: Reinforcement Learning (RL) on instruction-tuned model using GRPO.
- GRPO: Group Relative Policy Optimization - a simplified PPO variant.
- Eliminates need for a separate value model.
- RL Setup:
- Actor (Policy): DeepSeek Math Model (LLM).
- Environment: Math Problem Generator.
- Observation: Math Question.
- Action: Model’s Solution.
- Reward: Binary (Correct/Incorrect).
Simplified RL Explanation (Reinforce)
- Challenge: Reward function not differentiable.
- Reinforce Algorithm (Simplified):
- Assumption: Assume the model’s chosen action (solution) is correct.
- Loss Modulation: Modulate gradient update by the reward.
- High Reward: Reinforce the action.
- Low Reward: Less reinforcement.
- Advantage Function: In practice, use Advantage (Reward - Baseline) for normalization.
GRPO leverages similar principles to optimize the policy (DeepSeek Math Model) for higher math problem-solving reward.
Final RL Performance & Conclusion
- SOTA Performance: RL with GRPO pushes DeepSeek Math beyond all open-source models and very close to closed-source giants (GPT-4, Gemini).
- Outperforms Even Larger Models: Beats open-source models 10x larger, even those math-focused.
- Consistent Performance: Strong results in both English and Chinese.
Conclusion:
- DeepSeek Math achieves exceptional math reasoning through innovative data collection and efficient RL (GRPO).
- Demonstrates the power of targeted data and algorithm optimization even for smaller models.
Q & A / Further Discussion
- Open for questions and further discussion.
- Remember to join the Saturday paper discussions on Discord!