Deep neural networks (DNNs) have become the cornerstone of modern machine learning, but often demand extensive computational resources. To address this gap, model compression techniques have emerged as a vital area of research, aiming to reduce computational and storage overhead while preserving accuracy. Existing methods, such as knowledge distillation (transferring knowledge from a large ”teacher” model to a compact ”student” model) and quantization (reducing numerical precision of weights), have shown promise individually. However, the interplay between these techniques—particularly their combined impact on accuracy, storage efficiency, and inference speed—remains underexplored. In this project, we propose a hybrid compression strategy that systematically integrates knowledge distillation and post-training quantization to optimize the trade-off between model efficiency and performance.
Reference: Alex Krizhevsky, "Learning multiple layers of features from tiny images, 2009."
The primary inquiry of this research is to determine if a hybrid compression strategy, integrating Knowledge Distillation with Quantization, outperforms a singular approach.
Specific Question: Does quantizing a student model, derived through Knowledge Distillation, achieve greater efficiency than quantizing the teacher model?
Initially, we assess various KD techniques, then apply a greedy path-following quantization algorithm to reduce the student model's bit size. The subsequent sections detail the processes of knowledge distillation and quantization separately.
This strategy employs a hybrid loss combining cross-entropy from ground-truth labels (LC) and Kullback-Leibler divergence from the teacher’s soft targets (DKL), refined with temperature scaling T. The loss formula is:
This strategy improves knowledge distillation performance by creating new training samples through the linear interpolation of image and label pairs. Given two samples (xi, yi) and (xj, yj), it generates:
Deep Mutual Learning (DML):
This strategy utilizes two ResNet18 student models with different initial weights: one pretrained on CIFAR-10/CIFAR-100 and another with ImageNet-derived weights. This approach promoted diverse learning outcomes and speed up knowledge acquisition, particularly reducing training time for the CIFAR-pretrained model.
During training, we optimized both models, Θ1 and Θ2, with respective loss functions were:
This strategy divides the distillation loss into two key components:
This strategy utilizes greedy layer-wise quantization guided by accuracy impact analysis, compressing model while maintaining performance.
Our experimental findings present the performance of quantized student models, employing various knowledge distillation techniques, and a quantized teacher model across a range of bit sizes (2, 3, 4, 5, 6, 7, 8, 12, 16, 20, 26, 32).
Model | 2-bit | 3-bit | 4-bit | 5-bit | 6-bit | 7-bit | 8-bit | 32-bit |
---|---|---|---|---|---|---|---|---|
Teacher | 90.90 | 91.29 | 91.55 | 91.60 | 91.88 | 91.88 | 92.00 | 92.25 |
VKD Student | 71.87 | 82.40 | 86.47 | 88.78 | 88.89 | 89.36 | 89.69 | 90.77 |
Mixup Student | 88.12 | 92.63 | 94.20 | 95.09 | 95.21 | 95.18 | 95.45 | 95.77 |
DML Student | 82.39 | 86.98 | 90.45 | 91.84 | 92.00 | 92.43 | 92.54 | 92.89 |
DKD Student | 61.06 | 69.69 | 77.69 | 81.76 | 83.45 | 84.16 | 84.39 | 89.95 |
Model | 2-bit | 3-bit | 4-bit | 5-bit | 6-bit | 7-bit | 8-bit | 32-bit |
---|---|---|---|---|---|---|---|---|
Teacher | 66.63 | 74.17 | 75.06 | 75.84 | 75.73 | 75.78 | 75.80 | 76.43 |
VKD Student | 26.23 | 47.65 | 59.75 | 65.93 | 68.70 | 69.85 | 70.54 | 75.33 |
Mixup Student | 33.07 | 46.50 | 57.10 | 60.33 | 61.83 | 61.77 | 62.16 | 62.17 |
DML Student | 19.22 | 37.06 | 52.96 | 62.94 | 65.98 | 69.89 | 70.87 | 75.12 |
DKD Student | 22.20 | 33.11 | 44.78 | 49.76 | 52.20 | 52.76 | 52.85 | 58.01 |
Overall, the quantized teacher model outperforms the quantized student model, particularly at small bit sizes, such as 2 bits. However, this advantage diminishes when applied to less complex datasets.