重新考虑通过跨凝结的知识蒸馏

论文标题

重新考虑通过跨凝结的知识蒸馏

Rethinking Knowledge Distillation via Cross-Entropy

论文作者

Yang, Zhendong, Li, Zhe, Gong, Yuan, Zhang, Tianke, Lao, Shanshan, Yuan, Chun, Li, Yu

论文摘要

知识蒸馏（KD）已广泛发展并提高了各种任务。经典的KD方法将KD损失增加了原始的跨渗透（CE）损失。我们尝试分解KD损失，以探索其与CE损失的关系。令人惊讶的是，我们发现它可以视为CE损失的组合和额外的损失，其形式与CE损失相同。但是，我们注意到额外的损失会迫使学生学习教师绝对可能性的相对可能性。此外，这两个概率的总和是不同的，因此很难优化。为了解决这个问题，我们修改了配方并提出分布式损失。此外，我们将教师的目标输出作为软目标，提出软损失。结合软损失和分布式损失，我们提出了新的KD损失（NKD）。此外，我们将学生的目标输出稳定，将其视为无需教师的培训的软目标，并提出了无教师的新KD损失（TF-NKD）。我们的方法在CIFAR-100和Imagenet上实现了最先进的性能。例如，以Resnet-34为老师，我们将ImaSnet TOP-1的RESNET18的TOP-1精度从69.90％提高到71.96％。在没有教师的培训中，Mobilenet，Resnet-18和Swintransformer-tiny的培训占70.04％，70.76％和81.48％，分别比基线高0.83％，0.86％和0.30％。该代码可在https://github.com/yzd-v/cls_kd上找到。

Knowledge Distillation (KD) has developed extensively and boosted various tasks. The classical KD method adds the KD loss to the original cross-entropy (CE) loss. We try to decompose the KD loss to explore its relation with the CE loss. Surprisingly, we find it can be regarded as a combination of the CE loss and an extra loss which has the identical form as the CE loss. However, we notice the extra loss forces the student's relative probability to learn the teacher's absolute probability. Moreover, the sum of the two probabilities is different, making it hard to optimize. To address this issue, we revise the formulation and propose a distributed loss. In addition, we utilize teachers' target output as the soft target, proposing the soft loss. Combining the soft loss and the distributed loss, we propose a new KD loss (NKD). Furthermore, we smooth students' target output to treat it as the soft target for training without teachers and propose a teacher-free new KD loss (tf-NKD). Our method achieves state-of-the-art performance on CIFAR-100 and ImageNet. For example, with ResNet-34 as the teacher, we boost the ImageNet Top-1 accuracy of ResNet18 from 69.90% to 71.96%. In training without teachers, MobileNet, ResNet-18 and SwinTransformer-Tiny achieve 70.04%, 70.76%, and 81.48%, which are 0.83%, 0.86%, and 0.30% higher than the baseline, respectively. The code is available at https://github.com/yzd-v/cls_KD.

下载PDF全文

下载文献需遵守相关版权规定

论文标题