Balancing Accuracy and Efficiency: A Comparative Study of Knowledge Distillation and Post-Training Quantization Sequences

UC San Diego - Halıcıoğlu Data Science Institute (HDSI)
*Indicates Mentor

Introduction

Deep neural networks (DNNs) have become the cornerstone of modern machine learning, but often demand extensive computational resources. To address this gap, model compression techniques have emerged as a vital area of research, aiming to reduce computational and storage overhead while preserving accuracy. Existing methods, such as knowledge distillation (transferring knowledge from a large ”teacher” model to a compact ”student” model) and quantization (reducing numerical precision of weights), have shown promise individually. However, the interplay between these techniques—particularly their combined impact on accuracy, storage efficiency, and inference speed—remains underexplored. In this project, we propose a hybrid compression strategy that systematically integrates knowledge distillation and post-training quantization to optimize the trade-off between model efficiency and performance.

Dataset

  • Image Size: 32×32 pixels
  • Number of Images: 60,000 (50,000 Training Images + 10,000 Test Images)

Reference: Alex Krizhevsky, "Learning multiple layers of features from tiny images, 2009."

Research Question

The primary inquiry of this research is to determine if a hybrid compression strategy, integrating Knowledge Distillation with Quantization, outperforms a singular approach.

Specific Question: Does quantizing a student model, derived through Knowledge Distillation, achieve greater efficiency than quantizing the teacher model?

Pipeline

Resnet 50
Teacher's Model
KD Strategies
Resnet 18
Student's Model
Post-Training
Quantization
Quantized Resnet 18
Student Model
Quantized Resnet 50
Teacher Model
Post-Training Quantization
Number of layers: 50\n Number of parameter: 25557032\n Bit size: 32\n File size: 97.8 MB\n Accuracy: Higher accuracy\n Computational resources: More demanding\n

Method

Initially, we assess various KD techniques, then apply a greedy path-following quantization algorithm to reduce the student model's bit size. The subsequent sections detail the processes of knowledge distillation and quantization separately.

KD Strategies

  • Vanilla Knowledge Distillation (VKD):

    This strategy employs a hybrid loss combining cross-entropy from ground-truth labels (LC) and Kullback-Leibler divergence from the teacher’s soft targets (DKL), refined with temperature scaling T. The loss formula is:

    • L = (1 - α) · LC(y, ˆy) + α · T2 · DKL(p, q)

  • "Mixup" Method for Data Generation (Mixup):

    This strategy improves knowledge distillation performance by creating new training samples through the linear interpolation of image and label pairs. Given two samples (xi, yi) and (xj, yj), it generates:

    • ~x = λxi + (1 - λ)xj
    • ~y = λyi + (1 - λ)yj
  • Deep Mutual Learning (DML):

    This strategy utilizes two ResNet18 student models with different initial weights: one pretrained on CIFAR-10/CIFAR-100 and another with ImageNet-derived weights. This approach promoted diverse learning outcomes and speed up knowledge acquisition, particularly reducing training time for the CIFAR-pretrained model.

    During training, we optimized both models, Θ1 and Θ2, with respective loss functions were:

    • LΘ1 = LC1 + DKL(p2 ∥ p1)
    • LΘ2 = LC2 + DKL(p1 ∥ p2)
  • Decoupled Knowledge Distillation (DKD):

    This strategy divides the distillation loss into two key components:

    • Target Class Knowledge Distillation (TCKD): Focuses on aligning the student's predicted probability for the correct class with that of the teacher, ensuring accurate classification learning.
    • Non-Target Class Knowledge Distillation (NCKD): Aims to match the student's predicted probabilities for incorrect classes to the teacher's, enhancing overall probability distribution accuracy.

    The loss functions for these components are optimized using parameters α and β, represented by the formula:
    • L = α · LTCKD(pcorrect, qcorrect) + β · LNCKD(pincorrect, qincorrect)

Post-Training Quantization

  • GPFQ (Greedy Post-Training Feature Quantization):

    This strategy utilizes greedy layer-wise quantization guided by accuracy impact analysis, compressing model while maintaining performance.

Experiment Results

Our experimental findings present the performance of quantized student models, employing various knowledge distillation techniques, and a quantized teacher model across a range of bit sizes (2, 3, 4, 5, 6, 7, 8, 12, 16, 20, 26, 32).

CIFAR-10

Model 2-bit 3-bit 4-bit 5-bit 6-bit 7-bit 8-bit 32-bit
Teacher 90.90 91.29 91.55 91.60 91.88 91.88 92.00 92.25
VKD Student 71.87 82.40 86.47 88.78 88.89 89.36 89.69 90.77
Mixup Student 88.12 92.63 94.20 95.09 95.21 95.18 95.45 95.77
DML Student 82.39 86.98 90.45 91.84 92.00 92.43 92.54 92.89
DKD Student 61.06 69.69 77.69 81.76 83.45 84.16 84.39 89.95

CIFAR-100

Model 2-bit 3-bit 4-bit 5-bit 6-bit 7-bit 8-bit 32-bit
Teacher 66.63 74.17 75.06 75.84 75.73 75.78 75.80 76.43
VKD Student 26.23 47.65 59.75 65.93 68.70 69.85 70.54 75.33
Mixup Student 33.07 46.50 57.10 60.33 61.83 61.77 62.16 62.17
DML Student 19.22 37.06 52.96 62.94 65.98 69.89 70.87 75.12
DKD Student 22.20 33.11 44.78 49.76 52.20 52.76 52.85 58.01

Conclusion

CIFAR-100:
  • At 2–6 bits, all students initially drop in accuracy more sharply than the teacher.
  • Distillation aids complex datasets but reduces compressibility for further quantization.
CIFAR-10:
  • At 2–4 bits, all students see a sharper accuracy drop.
  • Both models are robust at higher bit widths, with distillation helping the student nearly match its pre-quantization accuracy.

Overall, the quantized teacher model outperforms the quantized student model, particularly at small bit sizes, such as 2 bits. However, this advantage diminishes when applied to less complex datasets.

Reference

  1. Cristian Bucilǎ, Rich Caruana, and Alexandru Niculescu-Mizil, "Model compression," in Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, USA, 2006, pp. 535–541. https://doi.org/10.1145/1150402.1150464
  2. Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean, "Distilling the Knowledge in a Neural Network," CoRR, vol. abs/1503.02531, 2015. http://arxiv.org/abs/1503.02531
  3. Adriana Romero et al., "FitNets: Hints for Thin Deep Nets," 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 2015. http://arxiv.org/abs/1412.6550
  4. Wonpyo Park et al., "Relational Knowledge Distillation," 2019. https://arxiv.org/abs/1904.05068
  5. Ahmed T. Elthakeb et al., "Divide and Conquer: Leveraging Intermediate Feature Representations for Quantized Training of Neural Networks," 2020. https://arxiv.org/abs/1906.06033
  6. Lingyu Gu et al., "\"Lossless\" Compression of Deep Neural Networks: A High-dimensional Neural Tangent Kernel Approach," 2024. https://arxiv.org/abs/2403.00258
  7. Jinjie Zhang et al., "Post-training Quantization for Neural Networks with Provable Guarantees," 2023. https://arxiv.org/abs/2201.11113
  8. Ying Zhang et al., "Deep Mutual Learning," 2017. https://arxiv.org/abs/1706.00384
  9. Lucas Beyer et al., "Knowledge distillation: A good teacher is patient and consistent," 2022. https://arxiv.org/abs/2106.05237
  10. Borui Zhao et al., "Decoupled Knowledge Distillation," 2022. https://arxiv.org/abs/2203.08679