Home DeepSeek Math: A Detailed Summary
Post
Cancel

DeepSeek Math: A Detailed Summary

Unlocking Mathematical Reasoning in Language Models

Based on a video summary of the DeepSeek Math Paper by Yannic Kilcher: Watch the video


Introduction

  • DeepSeek’s Rise: DeepSeek is a prominent name in AI, making this paper highly relevant despite its date.
  • GRPO Highlight: Introduces Group Relative Policy Optimization (GRPO), a key component of DeepSeek’s R1 model.
  • Paper Focus: Achieving state-of-the-art accuracy on math problem-solving benchmarks.
  • Video Source: Summarizing a detailed video explanation and discussion.

Impressive Result: DeepSeek Math 7B

  • Small Model, Big Performance: DeepSeek Math 7B parameter model achieves remarkable results.
  • Outperforms Larger Models: Matches or surpasses performance of much larger models like GPT-4 and Gemini Ultra on math benchmarks.
  • Math-Specific Tuning: Acknowledges GPT-4/Gemini are general models, but 7B still excels in math domain.
  • Significant Improvement: Represents a major leap over both general and other math-focused models.

Two-Pronged Approach to Success

DeepSeek’s success is built on two key pillars:

  1. High-Quality, Large-Scale Data Collection
    • Novel method to create a massive, relevant dataset for math.
  2. GRPO (Group Relative Policy Optimization)
    • Simplified & efficient Reinforcement Learning algorithm (PPO variant).

1. Data Collection: DeepSeek Math Corpus

  • DeepSeek Math Corpus: 120 Billion Math Tokens dataset.
  • Scale & Quality: Order of magnitude larger than existing math datasets.
  • Source: Surprisingly, extracted from Common Crawl - demonstrating readily available high-quality data.
  • Iterative Pipeline: Systematic process for both scale and relevance.

Iterative Data Collection Process

Iteration 1: Seed & Classification

  1. Seed Corpus: Start with a small, relevant dataset (e.g., OpenWebMath).
    • Often limited in size and diversity.
  2. FastText Classifier Training:
    • Train a fast text classifier to distinguish “math-like” content.
    • Positive examples: Seed Corpus (500k)
    • Negative examples: Random Common Crawl pages (500k)

Iterative Data Collection Process (Cont.)

Iteration 2: Sifting & Domain Expansion

  1. Common Crawl Sifting:
    • Classify a large, cleaned Common Crawl dataset (40B websites).
    • Rank pages by classifier score and keep top 40B tokens.
  2. Domain Expansion:
    • Group URLs by domain.
    • If >10% of URLs in a domain are “math-like”, reconsider entire domain.

Iterative Data Collection Process (Cont.)

Iteration 3: Manual Annotation & Refinement

  1. Manual Annotation:
    • Human (or potentially LLM-assisted) annotation of URLs from reconsidered domains.
    • Verifies and expands math relevance, increasing diversity.
  2. New Seed Corpus: Annotated data becomes the seed for the next iteration.

Repeat Iterations: Process is repeated, broadening scope and improving classifier.


Iterative Data Collection Process (Cont.)

Iteration 4 & Convergence

  • Process Stops: After 4 iterations, ~98% of data collected by iteration 3, indicating convergence.
  • Final Dataset: 35.5M web pages, 120B tokens.

Validation:

  • Trained a small model (1.3B) on DeepSeek Math Corpus vs. other math datasets.
  • DeepSeek Corpus significantly outperformed on benchmarks.
  • Key Features: Relevant, Large-scale, Multilingual.

2. Base Model: DeepSeek Math Base 7B

  • Initialization: Initialized from DeepSeek Coder Base 1.5 7B.
    • Deliberate choice: Code pre-training crucial for math reasoning.
    • Archive pre-training alone less effective.
  • Training Data Mix:
    • 56% DeepSeek Math Corpus + Algebra Stack, Archive, GitHub, Common Crawl (NL).
  • Training Duration: 500 Billion Tokens.
  • Benchmark Results: Outperforms other models (including much larger ones) on Math benchmarks with Chain of Thought and Tool Use.

Instruction Fine-tuning

  • Further Enhancement: Instruction fine-tuning on top of base model.
  • Fine-tuning Data: Annotated datasets (GSM8K, Math problems) with:
    • Tool-integrated solutions.
    • Chain of Thought & Program of Thought examples.
    • English & Chinese datasets.
  • Training Scale: ~500 steps, small batch size (~100k data points effective).
  • Performance Gain: Significant improvement, approaching closed-source models (GPT-4, Gemini Ultra).

3. Reinforcement Learning with GRPO

  • Final Step: Reinforcement Learning (RL) on instruction-tuned model using GRPO.
  • GRPO: Group Relative Policy Optimization - a simplified PPO variant.
    • Eliminates need for a separate value model.
  • RL Setup:
    • Actor (Policy): DeepSeek Math Model (LLM).
    • Environment: Math Problem Generator.
    • Observation: Math Question.
    • Action: Model’s Solution.
    • Reward: Binary (Correct/Incorrect).

Simplified RL Explanation (Reinforce)

  • Challenge: Reward function not differentiable.
  • Reinforce Algorithm (Simplified):
    • Assumption: Assume the model’s chosen action (solution) is correct.
    • Loss Modulation: Modulate gradient update by the reward.
      • High Reward: Reinforce the action.
      • Low Reward: Less reinforcement.
    • Advantage Function: In practice, use Advantage (Reward - Baseline) for normalization.

GRPO leverages similar principles to optimize the policy (DeepSeek Math Model) for higher math problem-solving reward.


Final RL Performance & Conclusion

  • SOTA Performance: RL with GRPO pushes DeepSeek Math beyond all open-source models and very close to closed-source giants (GPT-4, Gemini).
  • Outperforms Even Larger Models: Beats open-source models 10x larger, even those math-focused.
  • Consistent Performance: Strong results in both English and Chinese.

Conclusion:

  • DeepSeek Math achieves exceptional math reasoning through innovative data collection and efficient RL (GRPO).
  • Demonstrates the power of targeted data and algorithm optimization even for smaller models.

Q & A / Further Discussion

  • Open for questions and further discussion.
  • Remember to join the Saturday paper discussions on Discord!
This post is licensed under CC BY 4.0 by the author.