Skdbert：通过随机知识蒸馏压缩BERT

论文标题

Skdbert：通过随机知识蒸馏压缩BERT

SKDBERT: Compressing BERT via Stochastic Knowledge Distillation

论文作者

Ding, Zixiang, Jiang, Guoqing, Zhang, Shuai, Guo, Lin, Lin, Wei

论文摘要

在本文中，我们提出了随机知识蒸馏（SKD），以获取被称为Skdbert的紧凑型Bert风格的语言模型。在每次迭代中，SKD从一个预定的教师合奏中采样了教师模型，该模型由多个具有多层次能力的教师模型组成，以一对一的方式将知识转移到学生模型中。采样分布在SKD中起重要作用。我们可以启发三种类型的抽样分布，以分配多层教师模型的适当概率。 SKD具有两个优点：1）它可以通过随机对每种迭代中的单个教师模型进行多种教师模型的多样性，而2）当在教师模型和学生模型之间存在较大的能力差距时，它也可以通过多层次的教师模型来提高知识蒸馏的功效。胶水基准的实验结果表明，Skdbert将Bert $ _ {\ rm base} $模型的大小降低了40％，同时保留了99.5％的语言理解性能，并且快速100％。

In this paper, we propose Stochastic Knowledge Distillation (SKD) to obtain compact BERT-style language model dubbed SKDBERT. In each iteration, SKD samples a teacher model from a pre-defined teacher ensemble, which consists of multiple teacher models with multi-level capacities, to transfer knowledge into student model in an one-to-one manner. Sampling distribution plays an important role in SKD. We heuristically present three types of sampling distributions to assign appropriate probabilities for multi-level teacher models. SKD has two advantages: 1) it can preserve the diversities of multi-level teacher models via stochastically sampling single teacher model in each iteration, and 2) it can also improve the efficacy of knowledge distillation via multi-level teacher models when large capacity gap exists between the teacher model and the student model. Experimental results on GLUE benchmark show that SKDBERT reduces the size of a BERT$_{\rm BASE}$ model by 40% while retaining 99.5% performances of language understanding and being 100% faster.

下载PDF全文

下载文献需遵守相关版权规定

论文标题