Optimizing Neural Networks: A Deep Dive into Knowledge Distillation
Written on
Introduction:
Running deep neural networks on edge devices often demands substantial computing resources and storage. There are various methods available to enhance the efficiency of neural networks for deployment on such devices, including:
- Quantization: Reducing 32-bit weights to 8 bits or lower.
- Parameter Pruning: Eliminating redundant connections that don't significantly affect output.
- Knowledge Distillation: This technique involves transferring the knowledge from a larger teacher network to a smaller student network, representing a highly effective optimization strategy currently under extensive research.
For further details, refer to my article on Quantization at https://readmedium.com/network-optimization-with-quantization-8-bit-vs-1-bit-af2fd716fcae.
In this article, we will delve into the concept of Knowledge Distillation and explore how it can be utilized to enhance neural network performance for edge devices.
What is Knowledge Distillation:
The concept of Knowledge Distillation was introduced by Geoffrey Hinton and his team in their paper available at https://arxiv.org/abs/1503.02531. The core premise of this approach is as follows:
- Large networks, being deep and wide, excel at complex computer vision tasks, but training them on extensive datasets requires advanced systems equipped with both CPU and GPU.
- However, once the problem is understood and solved using a large network, the inference phase does not necessitate such a large model.
- The goal, therefore, is to transfer the knowledge from the extensive teacher network to a more compact student network through training.
Paper Analysis:
This foundational paper discusses two crucial aspects:
Modified Softmax Function with Temperature:
When outputs from the logit layer are processed through the softmax function, it tends to favor one output, which corresponds to the probable target value (for example, identifying a cat). Hinton proposed a modification to the softmax function as follows:
The rationale behind introducing the temperature (T) into the softmax function can be illustrated through a simple example.
import numpy as np
logits = np.array([1., 2., 3.])
logits_exp = np.exp(logits)
print("logits_exp: {}".format(logits_exp))
T = [1., 5., 7., 10.]
for t in T:
logits_exp_normalized = np.exp(logits / t) / sum(np.exp(logits / t))
print("temperature: {} : logits_exp_normalized:{}".format(t, logits_exp_normalized))
The output will demonstrate how varying the temperature affects the distribution of probabilities across the outputs.
- For T=1 (standard softmax), the emphasis is on the most probable class.
- Higher temperatures yield a more uniform distribution across outputs.
This indicates that a temperature greater than one normalizes the output distribution, retaining information about all target classes, which would be lost with standard softmax.
Hinton noted, "Matching logits is a specific instance of Knowledge Distillation," and utilized this technique for his initial experiments.
Soft-Labels and KL Divergence Loss:
By dividing the logits output with temperature, we obtain soft labels, which encapsulate information about all classes the teacher and student models aim to classify. To optimize these soft labels, KL (Kullback-Leibler) Divergence loss is employed as the distillation loss.
Drawbacks:
While the student network operates significantly faster than the teacher network, there exists a gap in accuracy between the two. This discrepancy arises because the process primarily considers the output layer of both networks, neglecting the knowledge contained in intermediate layers that convey feature information.
To address this, another paper by Romero et al. can be referenced at https://arxiv.org/pdf/1412.6550, which proposes the following:
- By focusing solely on the output layers, the intermediary layers of the teacher network are disregarded. If the intermediate layer information could be conveyed to the student network, it could further close the accuracy gap.
- The proposed method involves selecting a middle layer from the teacher, termed the Hint Layer, and transferring this knowledge to the student's Guided Layer via a small convolution layer.
This research indicated that the transfer of information from both the output and intermediary layers from teacher to student significantly enhances the performance of the student network in terms of both accuracy and speed.
These two foundational papers (https://arxiv.org/abs/1503.02531 and https://arxiv.org/pdf/1412.6550) provide essential insights into Knowledge Distillation and set the stage for further investigations in this domain. Understanding these works is vital for anyone exploring the field of Knowledge Distillation.
Further methods for Knowledge Distillation that emerged post these publications have demonstrated impressive performance gains.
Transfer Learning vs Transfer Knowledge (Knowledge Distillation):
- Transfer Learning: For instance, consider a Model-A trained on a dataset containing cats and dogs (two classes). This model can be adapted to classify additional classes, such as bears and horses, by:
- Retaining the same architecture while replacing the head with new layers for the four classes.
- During training, focusing solely on training the new layers to accommodate the expanded class set (training the backbone may also be an option post additional layer training).
- Transfer Knowledge (Knowledge Distillation): This process is distinct from transfer learning. Key differences include:
- If a teacher network is trained solely for two classes (cat and dog), the student network will also be limited to those two classes and cannot be extended to four classes as in transfer learning.
- The teacher and student networks are fundamentally different; thus, direct sharing of weights is not feasible.
- Unique strategies are required to train both networks to facilitate knowledge transfer.
Knowledge Distillation In Detail:
The diagram above provides a comprehensive illustration of Knowledge Distillation.
Let's examine the various types of knowledge involved:
Types of Knowledge:
In response-based knowledge distillation, only the output layer is utilized, which is less effective. However, when combined with other forms of knowledge, it can enhance performance.
Feature-based distillation leverages intermediate information and is significantly more effective. Relation-based knowledge distillation captures relationships between intermediate layers and transfers that knowledge to the student network.
How Teacher and Student Networks Can Be Trained (Distillation Scheme):
Here are various approaches for training the teacher and student networks.
How to Transfer Knowledge from Teacher to Student Network (Distillation Algorithms):
For more information on distillation algorithms, refer to https://arxiv.org/abs/2006.05525.
Implementation:
I have implemented the aforementioned scenario, and the code is available at https://github.com/satya15july/knowledge_distillation.
Evaluation:
Here is the evaluation data:
As illustrated in the data, the student model operates 12 times faster than the teacher model, and following knowledge distillation, the accuracy of the student model improved by 4%. Nonetheless, it is essential to note that the student model cannot achieve the same accuracy as the teacher model (59%).
The data indicates that the student model outperforms the larger teacher model when employing knowledge distillation methods like AT, AB, and SP. However, the student model's performance in semantic segmentation does not match that of the teacher model. It is crucial to recognize that these methods were primarily developed for classification tasks, not semantic segmentation, which explains the performance gap.
Let's emphasize task-specific knowledge distillation.
Task-Specific Knowledge Distillation:
All methods of knowledge distillation illustrated in Fig. 7 were applied to classification problems.
The question arises: can these methodologies be utilized for other tasks, such as object detection or semantic segmentation? The answer is affirmative; however, an accuracy gap between the teacher and student will likely persist.
To bridge this accuracy gap, it is necessary to comprehend the task-specific knowledge present in the teacher network and effectively transfer it to the student network. Examples include:
A detailed discussion on enhancing semantic segmentation tasks exceeds the scope of this article; I will address that in a future article to maintain brevity.
Conclusion:
This article aimed to provide an overview of Knowledge Distillation, beginning with the foundational works by Hinton and Romero, and then exploring how distillation techniques can be refined for task-specific challenges like object detection and semantic segmentation.
I hope this article aids in your understanding of Knowledge Distillation. Please consider subscribing to my Medium channel for more insights. Thank you for reading.
References:
- [KD]: Knowledge Distillation (https://arxiv.org/abs/1503.02531)
- [KD Survey]: https://arxiv.org/pdf/2006.05525
- [FitNet]: https://arxiv.org/pdf/1412.6550.pdf
- [CC]: Congruence Correlation (https://arxiv.org/abs/1904.01802)
- [SP]: Similarity Preserving (https://arxiv.org/pdf/1907.09682.pdf)
- [AB]: Activation Boundary (https://arxiv.org/pdf/1811.03233.pdf)
- [FT]: Factor Transfer (https://arxiv.org/pdf/1802.04977.pdf)
- [AT]: Attention Transfer (https://arxiv.org/pdf/1612.03928.pdf)
- [TDA]: Transformer Distillation through Attention (https://arxiv.org/pdf/2012.12877.pdf)
Reach me at:
- LinkedIn: www.linkedin.com/in/satya1507
- Email: [email protected]