Knowledge distillation strengthens small models by transferring insights from a larger, complex teacher model rather than training solely on raw data labels. This allows the compact student model to learn a richer understanding beyond its inherent capacity.

The core principle involves training the small model to mimic the teacher's softened output probabilities ("soft targets"), often generated using a higher temperature in the softmax function. These soft targets reveal the teacher's nuanced interpretations, such as similarities between classes. The student learns by minimizing a loss function combining prediction error on the hard labels and a distillation loss measuring alignment with the teacher's soft targets. Key conditions require a well-performing teacher and careful selection of temperature and loss weighting.

This technique enables deploying surprisingly capable small models on resource-constrained devices like phones or edge hardware where the large teacher model is impractical. By capturing the teacher's generalized behavior patterns, distilled models often achieve significantly higher accuracy than small models trained conventionally. The primary value lies in obtaining near-teacher performance with drastically reduced computational cost, memory footprint, and inference latency.

How does knowledge distillation make small models stronger?

関連する質問

Is there a big difference between fine-tuning and retraining a model?

What is the difference between zero-shot learning and few-shot learning?

What are the application scenarios of few-shot learning?

What are the differences between the BLEU metric and ROUGE?