使用不完美的合成语音改善电脑性语音识别的中间微调

论文标题

使用不完美的合成语音改善电脑性语音识别的中间微调

Intermediate Fine-Tuning Using Imperfect Synthetic Speech for Improving Electrolaryngeal Speech Recognition

论文作者

Violeta, Lester Phillip, Ma, Ding, Huang, Wen-Chin, Toda, Tomoki

论文摘要

由于小数据集，有关电脑扬声器的自动语音识别（ASR）系统的研究相对尚未探索。当ASR缺乏训练数据时，大规模预处理和微调框架通常足以达到高识别率。但是，在电脑性语音中，预处理和微调数据之间的域移动太大而无法克服，从而限制了识别率的最大提高。为了解决这一问题，我们提出了一个中间的微调步骤，该步骤使用不完美的合成语音来弥合训练和目标数据之间的域移位差距。尽管存在不完美的综合数据，但我们显示了此对电脑语音数据集的有效性，比不使用不完美的合成语音的基线提高了6.1％。结果表明，中间微调阶段如何专注于学习不完美的合成数据的高级固有特征，而不是低级功能（例如清晰度）。

Research on automatic speech recognition (ASR) systems for electrolaryngeal speakers has been relatively unexplored due to small datasets. When training data is lacking in ASR, a large-scale pretraining and fine tuning framework is often sufficient to achieve high recognition rates; however, in electrolaryngeal speech, the domain shift between the pretraining and fine-tuning data is too large to overcome, limiting the maximum improvement of recognition rates. To resolve this, we propose an intermediate fine-tuning step that uses imperfect synthetic speech to close the domain shift gap between the pretraining and target data. Despite the imperfect synthetic data, we show the effectiveness of this on electrolaryngeal speech datasets, with improvements of 6.1% over the baseline that did not use imperfect synthetic speech. Results show how the intermediate fine-tuning stage focuses on learning the high-level inherent features of the imperfect synthetic data rather than the low-level features such as intelligibility.

下载PDF全文

下载文献需遵守相关版权规定

论文标题