论文标题

通过多语言培训,转移学习,文本到文本映射和合成音频来引导端到端ASR系统

Bootstrap an end-to-end ASR system by multilingual training, transfer learning, text-to-text mapping and synthetic audio

论文作者

Giollo, Manuel, Gunceler, Deniz, Liu, Yulan, Willett, Daniel

论文摘要

长期以来,对有限数据资源的自举语音识别一直是积极研究的领域。最近向全神经模型和端到端(E2E)培训的过渡带来了特殊的挑战,因为这些模型被称为数据饥饿,但也围绕着来自多语言数据以及共享单词式输出表示的语言敏捷表示的机会,这些语言围绕着共享脚本和根源。我们在这里调查了基于低资源制度中基于RNN-Transducer(RNN-T)自动语音识别(ASR)系统的不同策略的有效性,同时利用其他语言中可用的丰富资源以及从文本到语音(TTS)引擎中的合成音频。我们的实验表明,使用ASR后的文本对文本映射和合成音频传递多种语模型的学习,从而提供了加法改进,从而使我们可以为新语言的模型引导一个具有本其他可能需要的数据的新语言模型。与单语基线相比,最佳系统的相对单词错误率(WER)降低了46%,其中25%的相对改善归因于ASR后的ASR文本到文本映射和TTS合成数据。

Bootstrapping speech recognition on limited data resources has been an area of active research for long. The recent transition to all-neural models and end-to-end (E2E) training brought along particular challenges as these models are known to be data hungry, but also came with opportunities around language-agnostic representations derived from multilingual data as well as shared word-piece output representations across languages that share script and roots. We investigate here the effectiveness of different strategies to bootstrap an RNN-Transducer (RNN-T) based automatic speech recognition (ASR) system in the low resource regime, while exploiting the abundant resources available in other languages as well as the synthetic audio from a text-to-speech (TTS) engine. Our experiments demonstrate that transfer learning from a multilingual model, using a post-ASR text-to-text mapping and synthetic audio deliver additive improvements, allowing us to bootstrap a model for a new language with a fraction of the data that would otherwise be needed. The best system achieved a 46% relative word error rate (WER) reduction compared to the monolingual baseline, among which 25% relative WER improvement is attributed to the post-ASR text-to-text mappings and the TTS synthetic data.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源