论文标题
预测知识蒸馏的多编码书矢量量化索引
Predicting Multi-Codebook Vector Quantization Indexes for Knowledge Distillation
论文作者
论文摘要
知识蒸馏(KD)是提高自动语音识别(ASR)模型性能(ASR)的常见方法,在该方法中,培训学生模型以模仿教师模型的输出行为。但是,传统的KD方法遇到了教师标签存储问题,尤其是在培训语料库很大的情况下。尽管即时的教师标签生成可以解决此问题,但由于必须评估每一批教师模型,因此训练速度的速度大大降低。在本文中,我们将教师标签的一代重新制定为编解码器问题。我们提出了一种新颖的多编码书矢量量化(MVQ)方法,该方法将教师嵌入到代码书索引(CI)中。基于此,提出了一个KD培训框架(MVQ-KD),其中学生模型可以预测从自我监督的预训练的预训练的教师模型的嵌入中产生的CI。 LibrisPeech清洁100小时的实验表明,MVQ-KD框架的性能与传统KD方法(L1,L2)相当,同时需要减少256倍的存储空间。当使用完整的LibrisPeech数据集时,MVQ-KD框架为未播放传感器的非延伸换能器和测试中的非流式传感器的相对单词错误率降低(WERR)和8.2%的相对单词错误率和4.0%和4.0%和4.9%的相对单词和4.9%的相对单词误差率和8.2%。这项工作的实施已经作为开源项目Icefall的一部分发布。
Knowledge distillation(KD) is a common approach to improve model performance in automatic speech recognition (ASR), where a student model is trained to imitate the output behaviour of a teacher model. However, traditional KD methods suffer from teacher label storage issue, especially when the training corpora are large. Although on-the-fly teacher label generation tackles this issue, the training speed is significantly slower as the teacher model has to be evaluated every batch. In this paper, we reformulate the generation of teacher label as a codec problem. We propose a novel Multi-codebook Vector Quantization (MVQ) approach that compresses teacher embeddings to codebook indexes (CI). Based on this, a KD training framework (MVQ-KD) is proposed where a student model predicts the CI generated from the embeddings of a self-supervised pre-trained teacher model. Experiments on the LibriSpeech clean-100 hour show that MVQ-KD framework achieves comparable performance as traditional KD methods (l1, l2), while requiring 256 times less storage. When the full LibriSpeech dataset is used, MVQ-KD framework results in 13.8% and 8.2% relative word error rate reductions (WERRs) for non -streaming transducer on test-clean and test-other and 4.0% and 4.9% for streaming transducer. The implementation of this work is already released as a part of the open-source project icefall.