Academic Project Page

Deep neural networks (DNNs) have become the cornerstone of modern machine learning, but often demand extensive computational resources. To address this gap, model compression techniques have emerged as a vital area of research, aiming to reduce computational and storage overhead while preserving accuracy. Existing methods, such as knowledge distillation (transferring knowledge from a large ”teacher” model to a compact ”student” model) and quantization (reducing numerical precision of weights), have shown promise individually. However, the interplay between these techniques—particularly their combined impact on accuracy, storage efficiency, and inference speed—remains underexplored. In this project, we propose a hybrid compression strategy that systematically integrates knowledge distillation and post-training quantization to optimize the trade-off between model efficiency and performance.

The primary inquiry of this research is to determine if a hybrid compression strategy, integrating Knowledge Distillation with Quantization, outperforms a singular approach.

Specific Question: Does quantizing a student model, derived through Knowledge Distillation, achieve greater efficiency than quantizing the teacher model?

Initially, we assess various KD techniques, then apply a greedy path-following quantization algorithm to reduce the student model's bit size. The subsequent sections detail the processes of knowledge distillation and quantization separately.

KD Strategies

Vanilla Knowledge Distillation (VKD):
This strategy employs a hybrid loss combining cross-entropy from ground-truth labels (L_C) and Kullback-Leibler divergence from the teacher’s soft targets (D_KL), refined with temperature scaling T. The loss formula is:
- L = (1 - α) · L_C(y, ˆy) + α · T² · D_KL(p, q)
"Mixup" Method for Data Generation (Mixup):
This strategy improves knowledge distillation performance by creating new training samples through the linear interpolation of image and label pairs. Given two samples (x_i, y_i) and (x_j, y_j), it generates:
- ~x = λx_i + (1 - λ)x_j
- ~y = λy_i + (1 - λ)y_j
Deep Mutual Learning (DML):
This strategy utilizes two ResNet18 student models with different initial weights: one pretrained on CIFAR-10/CIFAR-100 and another with ImageNet-derived weights. This approach promoted diverse learning outcomes and speed up knowledge acquisition, particularly reducing training time for the CIFAR-pretrained model.

During training, we optimized both models, Θ₁ and Θ₂, with respective loss functions were:
- L_Θ₁ = L_C₁ + D_KL(p₂ ∥ p₁)
- L_Θ₂ = L_C₂ + D_KL(p₁ ∥ p₂)
Decoupled Knowledge Distillation (DKD):
This strategy divides the distillation loss into two key components:
- Target Class Knowledge Distillation (TCKD): Focuses on aligning the student's predicted probability for the correct class with that of the teacher, ensuring accurate classification learning.
- Non-Target Class Knowledge Distillation (NCKD): Aims to match the student's predicted probabilities for incorrect classes to the teacher's, enhancing overall probability distribution accuracy.
The loss functions for these components are optimized using parameters α and β, represented by the formula:
- L = α · L_TCKD(p_correct, q_correct) + β · L_NCKD(p_incorrect, q_incorrect)

Post-Training Quantization

GPFQ (Greedy Post-Training Feature Quantization):
This strategy utilizes greedy layer-wise quantization guided by accuracy impact analysis, compressing model while maintaining performance.

Our experimental findings present the performance of quantized student models, employing various knowledge distillation techniques, and a quantized teacher model across a range of bit sizes (2, 3, 4, 5, 6, 7, 8, 12, 16, 20, 26, 32).

CIFAR-10

Model	2-bit	3-bit	4-bit	5-bit	6-bit	7-bit	8-bit	32-bit
Teacher	90.90	91.29	91.55	91.60	91.88	91.88	92.00	92.25
VKD Student	71.87	82.40	86.47	88.78	88.89	89.36	89.69	90.77
Mixup Student	88.12	92.63	94.20	95.09	95.21	95.18	95.45	95.77
DML Student	82.39	86.98	90.45	91.84	92.00	92.43	92.54	92.89
DKD Student	61.06	69.69	77.69	81.76	83.45	84.16	84.39	89.95

CIFAR-100

Model	2-bit	3-bit	4-bit	5-bit	6-bit	7-bit	8-bit	32-bit
Teacher	66.63	74.17	75.06	75.84	75.73	75.78	75.80	76.43
VKD Student	26.23	47.65	59.75	65.93	68.70	69.85	70.54	75.33
Mixup Student	33.07	46.50	57.10	60.33	61.83	61.77	62.16	62.17
DML Student	19.22	37.06	52.96	62.94	65.98	69.89	70.87	75.12
DKD Student	22.20	33.11	44.78	49.76	52.20	52.76	52.85	58.01

CIFAR-100:

At 2–6 bits, all students initially drop in accuracy more sharply than the teacher.
Distillation aids complex datasets but reduces compressibility for further quantization.

CIFAR-10:

At 2–4 bits, all students see a sharper accuracy drop.
Both models are robust at higher bit widths, with distillation helping the student nearly match its pre-quantization accuracy.

Overall, the quantized teacher model outperforms the quantized student model, particularly at small bit sizes, such as 2 bits. However, this advantage diminishes when applied to less complex datasets.

Cristian Bucilǎ, Rich Caruana, and Alexandru Niculescu-Mizil, "Model compression," in Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, USA, 2006, pp. 535–541. https://doi.org/10.1145/1150402.1150464
Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean, "Distilling the Knowledge in a Neural Network," CoRR, vol. abs/1503.02531, 2015. http://arxiv.org/abs/1503.02531
Adriana Romero et al., "FitNets: Hints for Thin Deep Nets," 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 2015. http://arxiv.org/abs/1412.6550
Wonpyo Park et al., "Relational Knowledge Distillation," 2019. https://arxiv.org/abs/1904.05068
Ahmed T. Elthakeb et al., "Divide and Conquer: Leveraging Intermediate Feature Representations for Quantized Training of Neural Networks," 2020. https://arxiv.org/abs/1906.06033
Lingyu Gu et al., "\"Lossless\" Compression of Deep Neural Networks: A High-dimensional Neural Tangent Kernel Approach," 2024. https://arxiv.org/abs/2403.00258
Jinjie Zhang et al., "Post-training Quantization for Neural Networks with Provable Guarantees," 2023. https://arxiv.org/abs/2201.11113
Ying Zhang et al., "Deep Mutual Learning," 2017. https://arxiv.org/abs/1706.00384
Lucas Beyer et al., "Knowledge distillation: A good teacher is patient and consistent," 2022. https://arxiv.org/abs/2106.05237
Borui Zhao et al., "Decoupled Knowledge Distillation," 2022. https://arxiv.org/abs/2203.08679

Balancing Accuracy and Efficiency: A Comparative Study of Knowledge Distillation and Post-Training Quantization Sequences

Introduction

Dataset

CIFAR-10: 10 classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck)

CIFAR-100: 100 (fine) classes (belong to 20 Superclasses, eg. aquatic mammals: beaver, dolphin, otter, seal, whale)

Research Question

Pipeline

Method

KD Strategies

Post-Training Quantization

Experiment Results

CIFAR-10

CIFAR-100

Conclusion

Reference