DeepSeek’s GRPO (Group Relative Policy Optimization)
Abstract
In this video, I break down DeepSeek’s Group Relative Policy Optimization (GRPO) from first principles, without assuming prior knowledge of Reinforcement Learning. By the end, you’ll understand the core RL building blocks that led to GRPO, including:
- Policy Gradient Methods
- The REINFORCE Algorithm
- Actor-Critic Models
- PPO (Proximal Policy Optimization)
- GRPO (Group-Relative policy Optimization)
Papers:
- GRPO paper (DeepSeekMath): https://arxiv.org/pdf/2402.03300
- DeepSeek-R1 paper: https://arxiv.org/pdf/2501.12948
- PPO paper: https://arxiv.org/pdf/1707.06347
- GAE paper: https://arxiv.org/pdf/1506.02438
- TRPO paper: https://arxiv.org/pdf/1502.05477