Deepseek R1: Main Takeaways and Insights
Bio
Home
Blog
History
Cateogry
Tags
Projects
Bio
Home
Blog
History
Cateogry
Tags
Projects
RLHF = Reinforcement Learning = Alignment tuning?
How they relate—and why they’re not identical
What is the ‘Aha Moment’ phenomenon in R1-Zero’s training?
What are the four phases of the DeepSeek R1 training process?
What is Group Relative Policy Optimization (GRPO)?
Worked Example with GRPO
Example Problem
Step 1: Group Sampling
Step 2: Advantage Calculation
Step 3: Policy Update
How GRPO improves upon PPO for language model training?