通过自我灌输学习双模式语音识别模型

论文标题

通过自我灌输学习双模式语音识别模型

Learning a Dual-Mode Speech Recognition Model via Self-Pruning

论文作者

Liu, Chunxi, Shangguan, Yuan, Yang, Haichuan, Shi, Yangyang, Krishnamoorthi, Raghuraman, Kalinli, Ozlem

论文摘要

越来越有兴趣将流和全文自动语音识别（ASR）网络统一到单个端到端ASR模型中，以简化两种用例的模型培训和部署。在现实世界中的ASR应用程序中，流媒体ASR模型通常在更多的存储和计算约束（例如，在嵌入式设备上）进行操作，而不是任何服务器端的全文模型。这项工作旨在在一个单个模型中共同优化多个子网络的Omni-Sparsity超级网训练的最新进展，这项工作旨在共同学习紧凑的稀疏式易位流媒体流媒体ASR模型，并在一个超级网中学习一个大型密集的服务器非流动模型。接下来，我们提出，在两个WAV2VEC 2.0自制学习和监督ASR微调上进行超级网训练不仅可以基本上改善先前工作中所示的大型非流式模型，还可以改善紧凑的稀疏流式流媒体模型。

There is growing interest in unifying the streaming and full-context automatic speech recognition (ASR) networks into a single end-to-end ASR model to simplify the model training and deployment for both use cases. While in real-world ASR applications, the streaming ASR models typically operate under more storage and computational constraints - e.g., on embedded devices - than any server-side full-context models. Motivated by the recent progress in Omni-sparsity supernet training, where multiple subnetworks are jointly optimized in one single model, this work aims to jointly learn a compact sparse on-device streaming ASR model, and a large dense server non-streaming model, in a single supernet. Next, we present that, performing supernet training on both wav2vec 2.0 self-supervised learning and supervised ASR fine-tuning can not only substantially improve the large non-streaming model as shown in prior works, and also be able to improve the compact sparse streaming model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题