DeepSeek V3, R10, & R1: A Detailed Overview

Introduction: DeepSeek’s Breakthrough Models

Based on a video summary of the DeepSeek Math Paper by Vibhu Sapra at Latent Space: Watch the video

Presentation overview of DeepSeek’s latest language models: V3, R10, and R1.
Focus on architecture, training methods, and performance.
Key Shift: Moving beyond simple scaling to efficient reasoning capabilities through novel RL techniques.
Initially dismissed V3, but crucial as base model for R10 & R1.
Presentation Outline:
- Two main models: R10 & R1
- Inference Time Scaling & Test Time Compute
- Emergent Reflection & Aha Moments

High Level Overview

Two models: R1-Zero and R1
- R1-Zero is a great reasoning only model trained on unraveled CoT with RL but it’s not a good general model
- R1 is created using outputs from R1-Zero And that 4-stage training method
They distil outputs from R1 into Qwen & Llama models
- Not natively trained but with just RL these models perform really well
- They test their distillation vs regular post trains and find their work does much much
- Release all of these as a series of models too
No talk about data or where it comes from lol
Performance is crazy good
Models are fully OS w/ MIT license - no training data or code
DeepSeek API is 3-10x faster & cheaper than other Infra (but data -> CCP)

DeepSeek V3: The Foundation

Standard LLM, Exceptional Efficiency

Base Model: Large Language Model serving as foundation for reasoning models.
GPT-4 Level Performance: Comparable capabilities for general language tasks.
Mixture of Experts (MoE):
- 671 Billion Total Parameters
- 37 Billion Active Parameters (Inference)
- GPT-4 Performance at 30B inference cost.
- Efficient batched inference at scale.

Deepseek V3 Real Quick (like 1 slide kinda quick)

Open Source, GPT-4o quality, 37b active params, not the reasoning one
MoE model w/ 671B params and 37B active (chonky af to run)
Costs $5m to train lol
- Blew up america real quick - $nvidia dropped 600m
Multi-head Latent Attention
14.8T tokens -> SFT -> RL -> model is p good
Multi-Token Prediction (sample efficient, Meta)
Fp8 training
Some long context extension - 32k then 128k

DeepSeek V3: Key Features & Training

Open Source, Affordable, & Technically Advanced

Open Source & MIT License: Weights are publicly available.
Surprisingly Low Training Cost Claim: ~$5 Million (debated, but relatively cheap).
Standard Training Process + Context Extension:
- Pre-training, SFT, RL.
- 15 Trillion Tokens.
- 32k & 128k Context Length Extension.
Technical Innovations:
- “Multi-Head Latent Attention”
- “Multi-Token Prediction” (Sample Efficiency)
- Auxiliary Loss Free Training (Simplified from V2).

DeepSeek V3: Performance & Use Cases

Fast, Cheap, and General Purpose

Fast & Efficient: 37B active parameters enable speed and affordability.
GPT-4 Level General Tasks: Excellent for chatbots and general language applications.
Good for Speed over Reasoning: Prioritizes efficiency; reasoning models may be slower.
V3 still shines: Future work suggests V3 outperforms reasoning models in some areas.

DeepSeek R10: Pure Reasoning Power

RL-Driven Reasoning Model

Novelty: First reasoning model discussed, highlighting innovative approach.
Pure RL Training (GRPO): Trained directly on DeepSeek V3 without supervised reasoning data (SFT).
GRPO (Group based Reward Policy Optimization):
- No Critic Model: Reduced compute cost.
- Group-Based Rewards: Rewards relative performance within sampled output groups.
- Stability Penalty (KL Divergence): Prevents drastic policy changes.

Deepseek R1-Zero

Apply pure RL directly to a V3-base model without any SFT data
Uses GRPO for RL which was introduced in the DeepSeek Math paper
- Reward is based on both Accuracy and Format
- Responses are verifiably accurate - math that checks out, leetcode that compiles to correct
- Format rewards - output thinking between <think> x </think> tags

R10: Reward Function & Training Data

Incentivizing Reasoning through RL

Reward Function Components:
- Accuracy: Verifiable correctness (math, code compilation).
- Format Reward: Encourages <think> and <answer> tags for structure.
- Verifiable Correctness: Training data from verifiable domains.
Training Data: “Hard Questions” Datasets (Math, Code, Reasoning Tasks).
Emergent Reasoning: Reasoning abilities develop without explicit reasoning examples in training data.

R10: Emergent Capabilities

Witnessing Intelligence Emerge

Increased Accuracy with Training Steps: Performance consistently improves with more RL steps.
Increased Response Length (Reasoning Steps): Model learns to reason more deeply, increasing response length.
Reflections: Model revisits and re-evaluates previous reasoning steps, exploring alternatives.
Aha Moments: Model recognizes efficient approaches mid-reasoning, pivoting strategy.
Quote: RL unlocks problem-solving strategies autonomously, “aha moment for researchers.”

GRPO

No critic model
- Use a group of sample generations to estimate rewards cutting compute cost
Group-based rewards - scores outputs when compared with a sampled group to reward relative performance
Stability updates - limits policy changes via clipped objective and KL-divergence penalty to avoid drastic changes if 1 sample is a lot better

R10: Performance & Benchmarks

Strong Reasoning Performance

Strong Benchmarks: Performs well on reasoning tasks (Math, Code, etc.).
Competitive with 01 & 01-mini: Approaches or surpasses OpenAI models on reasoning benchmarks.
Majority Voting Boost: Significant performance gains with majority voting (sampling and selecting consistent answers).
Trained on Hard Questions: Quality and difficulty of training data crucial for emergent abilities.

R10: Limitations

Imperfect Reasoning Model

Poor Readability: Reasoning traces can be verbose, unstructured, and not human-friendly.
Language Mixing: Inconsistent switching between English & Chinese (paper highlights this issue).
Not a General Chat Model: Optimized for reasoning, lacks general chat, safety, and conciseness.
Focus: Pure reasoning capability, not a user-friendly assistant.

DeepSeek R1: Reasoning Chat Model

Bridging Reasoning & Conversation

Building on R10: R1 leverages R10’s reasoning capabilities.
Goal: Create a usable reasoning chat model.
Four-Stage Training Process:
1. Cold Start
2. SFT (Supervised Fine-Tuning - likely chat data)
3. RL Stage 1
4. Rejection Sampling
5. RL Stage 2 (“Double RL”)

R1: Performance & Distillation

Top Tier & Easily Transferable

Reasoning Chat Model: Retains reasoning strength of R10, adds chat functionality.
01 Level Performance: Positioned as on par with OpenAI’s 01 models overall.
Distillation to Llama 3 & Qwen: Reasoning abilities transferred to other open-source models efficiently.
Distillation > RL Fine-tuning: Distillation found to be more effective and computationally cheaper for transferring reasoning.
Implications: Widespread access to reasoning capabilities in various models.

Deepseek R1-Zero

R1-Zero performs really well at reasoning without any labeled SFT data - it’s just a base model trained with RL to output correct responses reasoning
Does really well with majority voting (is response there in multiple attempts)
They trained this thing on very hard questions
Charts showing inference time is correlated w/ eval performance (next)

Model	AIME 2024 pass@1	MATH-500 cons@64	GPQA Diamond pass@1	LiveCode Bench pass@1	CodeForces pass@1	rating
OpenAI-01-mini	63.6	80.0	90.0	60.0	53.8	1820
OpenAI-01-0912	74.4	83.3	94.8	77.3	63.4	1843
DeepSeek-R1-Zero	71.0	86.7	95.9	73.3	50.0	1444

_Table 2

Comparison of DeepSeek-R1-Zero and OpenAI 01 models on reasoning-related benchmarks._

Deepseek R1-Zero - Emergent Behaviors

Naturally squares ability to solve complex tasks by extending test time compute
- Ranges from hundreds to thousands of reasoning tokens
Emergence of interesting behaviors as test-time compute increases
Reflections - model revisits and reevaluates previous steps and explores alternatives arise spontaneously (not explicitly programmed)
Aha Moment - takes more time to think by reevaluating original approach

“This moment is not only an “aha moment” for the model but also for the researchers observing its behavior. It underscores the power and beauty of reinforcement learning: rather than explicitly teaching the model on how to solve a problem, we simply provide it with the right incentives, and it autonomously develops advanced problem-solving strategies. The “aha moment” serves as a powerful reminder of the potential of RL to unlock new levels of intelligence in artificial systems, paving the way for more autonomous and adaptive models in the future.”

Deepseek-R1

R1-Zero had poor readability and language mixing so they made it good with R1
Cold-start -> RL for reasoning -> Rejection Sampling for generation -> RL
(Stage 1) - Cold start training with strong SFT prevents model from going unstable early
- Use a long CoT few-shot prompt and prompt models to generate detailed answers w/ reflection and verification, generate R1-Zero results, and post-process w/ human annotators
- Better readability -> output format of <tok> <reasoning> <tok> <summary>
- On the sale of thousands of examples
(Stage 2) - Reasoning based RL (similar to R1-Zero)
- Same RL process, add a language consistency reward

Deepseek-R1 (Continued)

(Stage 3) - Rejection Sampling
- Generate completions, rank them w/ reward model and finetune original model
- Was standard in Llama3 and many others
- 800k completions total - 600k reasoning, 200k general chat problems
(Stage 4) - Final RL training for general use
- Make model helpful and harmless while making reasoning good
- For reasoning - they use R1-Zero style questions (like hard math, code, etc.)
- For general chat - capture human preference in nuance scenarios looking at process & output
Model is now very good at normal use alongside chat

Distillation

Distil outputs from R1 into Llama & Qwen models
- 800k reasoning samples
- Basic SFT - no RL
- RL can do better but they’re leaving that for broader research community to try
Models perform really well and output reasoning traces

Open Source & Practical Aspects

Accessibility & Efficiency

Open Weights (MIT License): V3, R10, R1, and distilled models are open. (No training data/code).
DeepSeek API Advantages:
- 3-10x Faster & Cheaper than other providers.
- Optimized infrastructure due to DeepSeek’s model understanding.
- Caution: Data sent to DeepSeek servers.
Paradigm Shift: Focus on inference efficiency & specialized RL over just scaling size.
Highly Capable, Affordable Models: Demonstrates achieving strong performance without massive compute costs.

Future Work & Conclusion

The Power of RL for Reasoning

Future Directions: Further improvements to reasoning models anticipated.
Key Takeaway: Reinforcement Learning (GRPO) is highly effective for developing strong reasoning in LLMs.
No Supervised Reasoning Data Needed: RL achieves reasoning without explicit examples.
Emergent Intelligence: RL unlocks potential for autonomous problem-solving and advanced AI.
“Aha Moment” for Researchers: RL’s power to create intelligent behavior is significant and inspiring.

Questions & Discussion

Total vs. Active Parameters (MoE): Efficiency at scale.
FPA Training (V3): Floating Point Auxiliary training for efficiency.
$5M Training Cost: Plausible ballpark, debated but relatively cheap.
RL Scoring Function: Accuracy, Format, Verifiable Correctness.
Even Simple Reasoning Transfers: Initial learnings in simple reasoning translate to complex tasks.

DeepSeek V3, R10, & R1: A Detailed Overview

Introduction: DeepSeek’s Breakthrough Models

High Level Overview

DeepSeek V3: The Foundation

Standard LLM, Exceptional Efficiency

Deepseek V3 Real Quick (like 1 slide kinda quick)

DeepSeek V3: Key Features & Training

Open Source, Affordable, & Technically Advanced

DeepSeek V3: Performance & Use Cases

Fast, Cheap, and General Purpose

DeepSeek R10: Pure Reasoning Power

RL-Driven Reasoning Model

Deepseek R1-Zero

R10: Reward Function & Training Data

Incentivizing Reasoning through RL

R10: Emergent Capabilities

Witnessing Intelligence Emerge

GRPO

R10: Performance & Benchmarks

Strong Reasoning Performance

R10: Limitations

Imperfect Reasoning Model

DeepSeek R1: Reasoning Chat Model

Bridging Reasoning & Conversation

R1: Performance & Distillation

Top Tier & Easily Transferable

Deepseek R1-Zero

Deepseek R1-Zero - Emergent Behaviors

Deepseek-R1

Deepseek-R1 (Continued)

Distillation

Open Source & Practical Aspects

Accessibility & Efficiency

Future Work & Conclusion

The Power of RL for Reasoning

Questions & Discussion

Further Reading

DeepSeek Math: A Detailed Summary

Navigating the Labyrinth: A Cybersecurity Specialist's Guide to Frameworks and AI Risk

Resetting Your MAC Address on macOS and Linux