Home DeepSeek V3, R10, & R1: A Detailed Overview
Post
Cancel

DeepSeek V3, R10, & R1: A Detailed Overview

Introduction: DeepSeek’s Breakthrough Models

Based on a video summary of the DeepSeek Math Paper by Vibhu Sapra at Latent Space: Watch the video

  • Presentation overview of DeepSeek’s latest language models: V3, R10, and R1.
  • Focus on architecture, training methods, and performance.
  • Key Shift: Moving beyond simple scaling to efficient reasoning capabilities through novel RL techniques.
  • Initially dismissed V3, but crucial as base model for R10 & R1.
  • Presentation Outline:
    • Two main models: R10 & R1
    • Inference Time Scaling & Test Time Compute
    • Emergent Reflection & Aha Moments

High Level Overview

  • Two models: R1-Zero and R1
    • R1-Zero is a great reasoning only model trained on unraveled CoT with RL but it’s not a good general model
    • R1 is created using outputs from R1-Zero And that 4-stage training method
  • They distil outputs from R1 into Qwen & Llama models
    • Not natively trained but with just RL these models perform really well
    • They test their distillation vs regular post trains and find their work does much much
    • Release all of these as a series of models too
  • No talk about data or where it comes from lol

  • Performance is crazy good

  • Models are fully OS w/ MIT license - no training data or code

  • DeepSeek API is 3-10x faster & cheaper than other Infra (but data -> CCP)

DeepSeek V3: The Foundation

Standard LLM, Exceptional Efficiency

  • Base Model: Large Language Model serving as foundation for reasoning models.
  • GPT-4 Level Performance: Comparable capabilities for general language tasks.
  • Mixture of Experts (MoE):
    • 671 Billion Total Parameters
    • 37 Billion Active Parameters (Inference)
    • GPT-4 Performance at 30B inference cost.
    • Efficient batched inference at scale.

Deepseek V3 Real Quick (like 1 slide kinda quick)

  • Open Source, GPT-4o quality, 37b active params, not the reasoning one
  • MoE model w/ 671B params and 37B active (chonky af to run)
  • Costs $5m to train lol
    • Blew up america real quick - $nvidia dropped 600m
  • Multi-head Latent Attention
  • 14.8T tokens -> SFT -> RL -> model is p good
  • Multi-Token Prediction (sample efficient, Meta)
  • Fp8 training
  • Some long context extension - 32k then 128k

DeepSeek V3: Key Features & Training

Open Source, Affordable, & Technically Advanced

  • Open Source & MIT License: Weights are publicly available.
  • Surprisingly Low Training Cost Claim: ~$5 Million (debated, but relatively cheap).
  • Standard Training Process + Context Extension:
    • Pre-training, SFT, RL.
    • 15 Trillion Tokens.
    • 32k & 128k Context Length Extension.
  • Technical Innovations:
    • “Multi-Head Latent Attention”
    • “Multi-Token Prediction” (Sample Efficiency)
    • Auxiliary Loss Free Training (Simplified from V2).

DeepSeek V3: Performance & Use Cases

Fast, Cheap, and General Purpose

  • Fast & Efficient: 37B active parameters enable speed and affordability.
  • GPT-4 Level General Tasks: Excellent for chatbots and general language applications.
  • Good for Speed over Reasoning: Prioritizes efficiency; reasoning models may be slower.
  • V3 still shines: Future work suggests V3 outperforms reasoning models in some areas.

DeepSeek R10: Pure Reasoning Power

RL-Driven Reasoning Model

  • Novelty: First reasoning model discussed, highlighting innovative approach.
  • Pure RL Training (GRPO): Trained directly on DeepSeek V3 without supervised reasoning data (SFT).
  • GRPO (Group based Reward Policy Optimization):
    • No Critic Model: Reduced compute cost.
    • Group-Based Rewards: Rewards relative performance within sampled output groups.
    • Stability Penalty (KL Divergence): Prevents drastic policy changes.

Deepseek R1-Zero

  • Apply pure RL directly to a V3-base model without any SFT data

  • Uses GRPO for RL which was introduced in the DeepSeek Math paper

    • Reward is based on both Accuracy and Format
    • Responses are verifiably accurate - math that checks out, leetcode that compiles to correct
    • Format rewards - output thinking between <think> x </think> tags

R10: Reward Function & Training Data

Incentivizing Reasoning through RL

  • Reward Function Components:
    • Accuracy: Verifiable correctness (math, code compilation).
    • Format Reward: Encourages <think> and <answer> tags for structure.
    • Verifiable Correctness: Training data from verifiable domains.
  • Training Data: “Hard Questions” Datasets (Math, Code, Reasoning Tasks).
  • Emergent Reasoning: Reasoning abilities develop without explicit reasoning examples in training data.

R10: Emergent Capabilities

Witnessing Intelligence Emerge

  • Increased Accuracy with Training Steps: Performance consistently improves with more RL steps.
  • Increased Response Length (Reasoning Steps): Model learns to reason more deeply, increasing response length.
  • Reflections: Model revisits and re-evaluates previous reasoning steps, exploring alternatives.
  • Aha Moments: Model recognizes efficient approaches mid-reasoning, pivoting strategy.
  • Quote: RL unlocks problem-solving strategies autonomously, “aha moment for researchers.”

GRPO

  • No critic model
    • Use a group of sample generations to estimate rewards cutting compute cost
  • Group-based rewards - scores outputs when compared with a sampled group to reward relative performance

  • Stability updates - limits policy changes via clipped objective and KL-divergence penalty to avoid drastic changes if 1 sample is a lot better

R10: Performance & Benchmarks

Strong Reasoning Performance

  • Strong Benchmarks: Performs well on reasoning tasks (Math, Code, etc.).
  • Competitive with 01 & 01-mini: Approaches or surpasses OpenAI models on reasoning benchmarks.
  • Majority Voting Boost: Significant performance gains with majority voting (sampling and selecting consistent answers).
  • Trained on Hard Questions: Quality and difficulty of training data crucial for emergent abilities.

R10: Limitations

Imperfect Reasoning Model

  • Poor Readability: Reasoning traces can be verbose, unstructured, and not human-friendly.
  • Language Mixing: Inconsistent switching between English & Chinese (paper highlights this issue).
  • Not a General Chat Model: Optimized for reasoning, lacks general chat, safety, and conciseness.
  • Focus: Pure reasoning capability, not a user-friendly assistant.

DeepSeek R1: Reasoning Chat Model

Bridging Reasoning & Conversation

  • Building on R10: R1 leverages R10’s reasoning capabilities.
  • Goal: Create a usable reasoning chat model.
  • Four-Stage Training Process:
    1. Cold Start
    2. SFT (Supervised Fine-Tuning - likely chat data)
    3. RL Stage 1
    4. Rejection Sampling
    5. RL Stage 2 (“Double RL”)

R1: Performance & Distillation

Top Tier & Easily Transferable

  • Reasoning Chat Model: Retains reasoning strength of R10, adds chat functionality.
  • 01 Level Performance: Positioned as on par with OpenAI’s 01 models overall.
  • Distillation to Llama 3 & Qwen: Reasoning abilities transferred to other open-source models efficiently.
  • Distillation > RL Fine-tuning: Distillation found to be more effective and computationally cheaper for transferring reasoning.
  • Implications: Widespread access to reasoning capabilities in various models.

Deepseek R1-Zero

  • R1-Zero performs really well at reasoning without any labeled SFT data - it’s just a base model trained with RL to output correct responses reasoning

  • Does really well with majority voting (is response there in multiple attempts)

  • They trained this thing on very hard questions

  • Charts showing inference time is correlated w/ eval performance (next)

Model AIME 2024
pass@1
MATH-500
cons@64
GPQA
Diamond
pass@1
LiveCode
Bench
pass@1
CodeForces
pass@1
rating
OpenAI-01-mini 63.6 80.0 90.0 60.0 53.8 1820
OpenAI-01-0912 74.4 83.3 94.8 77.3 63.4 1843
DeepSeek-R1-Zero 71.0 86.7 95.9 73.3 50.0 1444
*Table 2 Comparison of DeepSeek-R1-Zero and OpenAI 01 models on reasoning-related benchmarks.*

Deepseek R1-Zero - Emergent Behaviors

  • Naturally squares ability to solve complex tasks by extending test time compute
    • Ranges from hundreds to thousands of reasoning tokens
  • Emergence of interesting behaviors as test-time compute increases

  • Reflections - model revisits and reevaluates previous steps and explores alternatives arise spontaneously (not explicitly programmed)

  • Aha Moment - takes more time to think by reevaluating original approach

“This moment is not only an “aha moment” for the model but also for the researchers observing its behavior. It underscores the power and beauty of reinforcement learning: rather than explicitly teaching the model on how to solve a problem, we simply provide it with the right incentives, and it autonomously develops advanced problem-solving strategies. The “aha moment” serves as a powerful reminder of the potential of RL to unlock new levels of intelligence in artificial systems, paving the way for more autonomous and adaptive models in the future.”


Deepseek-R1

  • R1-Zero had poor readability and language mixing so they made it good with R1

  • Cold-start -> RL for reasoning -> Rejection Sampling for generation -> RL

  • (Stage 1) - Cold start training with strong SFT prevents model from going unstable early

    • Use a long CoT few-shot prompt and prompt models to generate detailed answers w/ reflection and verification, generate R1-Zero results, and post-process w/ human annotators
    • Better readability -> output format of <tok> <reasoning> <tok> <summary>
    • On the sale of thousands of examples
  • (Stage 2) - Reasoning based RL (similar to R1-Zero)

    • Same RL process, add a language consistency reward

Deepseek-R1 (Continued)

  • (Stage 3) - Rejection Sampling

    • Generate completions, rank them w/ reward model and finetune original model
    • Was standard in Llama3 and many others
    • 800k completions total - 600k reasoning, 200k general chat problems
  • (Stage 4) - Final RL training for general use

    • Make model helpful and harmless while making reasoning good
    • For reasoning - they use R1-Zero style questions (like hard math, code, etc.)
    • For general chat - capture human preference in nuance scenarios looking at process & output
  • Model is now very good at normal use alongside chat


Distillation

  • Distil outputs from R1 into Llama & Qwen models
    • 800k reasoning samples
    • Basic SFT - no RL
    • RL can do better but they’re leaving that for broader research community to try
  • Models perform really well and output reasoning traces

Open Source & Practical Aspects

Accessibility & Efficiency

  • Open Weights (MIT License): V3, R10, R1, and distilled models are open. (No training data/code).
  • DeepSeek API Advantages:
    • 3-10x Faster & Cheaper than other providers.
    • Optimized infrastructure due to DeepSeek’s model understanding.
    • Caution: Data sent to DeepSeek servers.
  • Paradigm Shift: Focus on inference efficiency & specialized RL over just scaling size.
  • Highly Capable, Affordable Models: Demonstrates achieving strong performance without massive compute costs.

Future Work & Conclusion

The Power of RL for Reasoning

  • Future Directions: Further improvements to reasoning models anticipated.
  • Key Takeaway: Reinforcement Learning (GRPO) is highly effective for developing strong reasoning in LLMs.
  • No Supervised Reasoning Data Needed: RL achieves reasoning without explicit examples.
  • Emergent Intelligence: RL unlocks potential for autonomous problem-solving and advanced AI.
  • “Aha Moment” for Researchers: RL’s power to create intelligent behavior is significant and inspiring.

Questions & Discussion

  • Total vs. Active Parameters (MoE): Efficiency at scale.
  • FPA Training (V3): Floating Point Auxiliary training for efficiency.
  • $5M Training Cost: Plausible ballpark, debated but relatively cheap.
  • RL Scoring Function: Accuracy, Format, Verifiable Correctness.
  • Even Simple Reasoning Transfers: Initial learnings in simple reasoning translate to complex tasks.

This post is licensed under CC BY 4.0 by the author.