改进了嘈杂的学生培训以自动语音识别

论文标题

改进了嘈杂的学生培训以自动语音识别

Improved Noisy Student Training for Automatic Speech Recognition

论文作者

Park, Daniel S., Zhang, Yu, Jia, Ye, Han, Wei, Chiu, Chung-Cheng, Li, Bo, Wu, Yonghui, Le, Quoc V.

论文摘要

最近，已证明一种被称为“嘈杂的学生培训”的半监督学习方法可显着提高深层网络的图像分类性能。嘈杂的学生培训是一种迭代自我训练方法，利用增强来提高网络性能。在这项工作中，我们采用（自适应）规范作为增强方法，适应并改善嘈杂的学生培训以自动语音识别。我们发现有效的方法可以过滤，平衡和增强自训练迭代之间生成的数据。通过这样做，我们能够通过仅将LibrisPeech的100h子集用作监督集和其余的（860H）作为未标记的集合，从而在干净/嘈杂的Librispeech测试集中获得4.2％/8.6％的单词错误率（WERS）。此外，我们能够通过使用Librilight的UNLAB-60K子集作为LibrisPeech 960H的未标记集，可以在干净/嘈杂的Librispeech测试集上实现1.7％/3.4％的WERS。因此，我们能够改善在LibrisPeech 100h（4.74％/12.20％）和LibrisPeech（1.9％/4.1％）上实现的先前最先进的清洁/嘈杂测试。

Recently, a semi-supervised learning method known as "noisy student training" has been shown to improve image classification performance of deep networks significantly. Noisy student training is an iterative self-training method that leverages augmentation to improve network performance. In this work, we adapt and improve noisy student training for automatic speech recognition, employing (adaptive) SpecAugment as the augmentation method. We find effective methods to filter, balance and augment the data generated in between self-training iterations. By doing so, we are able to obtain word error rates (WERs) 4.2%/8.6% on the clean/noisy LibriSpeech test sets by only using the clean 100h subset of LibriSpeech as the supervised set and the rest (860h) as the unlabeled set. Furthermore, we are able to achieve WERs 1.7%/3.4% on the clean/noisy LibriSpeech test sets by using the unlab-60k subset of LibriLight as the unlabeled set for LibriSpeech 960h. We are thus able to improve upon the previous state-of-the-art clean/noisy test WERs achieved on LibriSpeech 100h (4.74%/12.20%) and LibriSpeech (1.9%/4.1%).

下载PDF全文

下载文献需遵守相关版权规定

论文标题