使用合成数据的端到端ASR系统适应域适应的简单基线

论文标题

使用合成数据的端到端ASR系统适应域适应的简单基线

A Simple Baseline for Domain Adaptation in End to End ASR Systems Using Synthetic Data

论文作者

Joshi, Raviraj, Singh, Anupam

论文摘要

自动语音识别（ASR）已由基于深度学习的端到端语音识别模型主导。这些方法需要大量的标记数据，以音频文本对的形式。此外，与传统模型相比，这些模型更容易受到域转移的影响。训练通用ASR模型，然后使用相对较小的数据集对目标域进行调整是普遍的做法。我们考虑了一个更极端的域适应案例，其中只有文本语料库可用。在这项工作中，我们为端到端语音识别模型提出了一种简单的基线技术，以适应域的适应性。我们使用单个扬声器文本将纯文本语料库转换为音频数据，转换为语音（TTS）引擎。然后，使用目标域中的并行数据来微调通用ASR模型的最终致密层。我们表明，单扬声器合成TTS数据与最终密集层仅微调相结合，可以合理地提高单词错误率。我们使用来自地址和电子商务搜索域中的文本数据来显示我们低成本基线方法对CTC和基于注意的模型的有效性。

Automatic Speech Recognition(ASR) has been dominated by deep learning-based end-to-end speech recognition models. These approaches require large amounts of labeled data in the form of audio-text pairs. Moreover, these models are more susceptible to domain shift as compared to traditional models. It is common practice to train generic ASR models and then adapt them to target domains using comparatively smaller data sets. We consider a more extreme case of domain adaptation where text-only corpus is available. In this work, we propose a simple baseline technique for domain adaptation in end-to-end speech recognition models. We convert the text-only corpus to audio data using single speaker Text to Speech (TTS) engine. The parallel data in the target domain is then used to fine-tune the final dense layer of generic ASR models. We show that single speaker synthetic TTS data coupled with final dense layer only fine-tuning provides reasonable improvements in word error rates. We use text data from address and e-commerce search domains to show the effectiveness of our low-cost baseline approach on CTC and attention-based models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题