Distillation trains a smaller "student" model to mimic a larger "teacher" model's behavior. Instead of learning from hard labels, the student learns from the teacher's soft probability distributions, capturing the teacher's "dark knowledge" about relationships between classes.
Distillation enables deploying smaller, faster models that retain much of the larger model's capability.