Knowledge Distillation - How LLMs train each other
Abstract
was prominently discussed at LlamaCon 2025.
You’ll learn:
- What knowledge distillation really is (and what it’s not)
- How it helps scale LLMs without bloating inference cost
- The origin story from ensembles and model compression (2006) to Hinton’s “dark knowledge” paper (2015)
- Why “soft labels” carry more information than one-hot targets
- How companies like Google, Meta, and DeepSeek apply distillation differently
- The true meaning behind terms like temperature, behavioral cloning, and co-distillation
Whether you’re building, training, or just trying to understand modern AI systems, this video gives you a deep but accessible introduction to how LLMs teach each other.