了解和改善知识蒸馏以量化大型变压器编码器的量化培训

论文标题

了解和改善知识蒸馏以量化大型变压器编码器的量化培训

Understanding and Improving Knowledge Distillation for Quantization-Aware Training of Large Transformer Encoders

论文作者

Kim, Minsoo, Lee, Sihwa, Hong, Sukjin, Chang, Du-Seong, Choi, Jungwook

论文摘要

知识蒸馏（KD）一直是一种无处不在的方法，用于模型压缩，可以通过从教师传递的知识来增强轻量级模型的能力。特别是，KD已被用于诸如BERT之类的变压器编码器的量化感知训练（QAT），以减少精确的权重参数来提高学生模型的准确性。但是，关于哪种KD方法中哪种最适合变形金刚的Qat，几乎没有理解。在这项工作中，我们对KD机制在量化大型变压器的注意恢复方面提供了深入的分析。特别是，我们揭示了以前采用的注意力评分上采用的MSE损失不足以恢复自我发挥的信息。因此，我们提出了两种KD方法。注意图和注意力输出损失。此外，我们探讨了两种损失的统一，以解决注意力图和输出损失之间的任务依赖性偏好。各种变压器编码器模型的实验结果表明，所提出的KD方法具有以下2位重量量化的QAT的最新精度。

Knowledge distillation (KD) has been a ubiquitous method for model compression to strengthen the capability of a lightweight model with the transferred knowledge from the teacher. In particular, KD has been employed in quantization-aware training (QAT) of Transformer encoders like BERT to improve the accuracy of the student model with the reduced-precision weight parameters. However, little is understood about which of the various KD approaches best fits the QAT of Transformers. In this work, we provide an in-depth analysis of the mechanism of KD on attention recovery of quantized large Transformers. In particular, we reveal that the previously adopted MSE loss on the attention score is insufficient for recovering the self-attention information. Therefore, we propose two KD methods; attention-map and attention-output losses. Furthermore, we explore the unification of both losses to address task-dependent preference between attention-map and output losses. The experimental results on various Transformer encoder models demonstrate that the proposed KD methods achieve state-of-the-art accuracy for QAT with sub-2-bit weight quantization.

下载PDF全文

下载文献需遵守相关版权规定

论文标题