通过单环和双环多模式链框架增强ASR和TT的图像

论文标题

通过单环和双环多模式链框架增强ASR和TT的图像

Augmenting Images for ASR and TTS through Single-loop and Dual-loop Multimodal Chain Framework

论文作者

Effendi, Johanes, Tjandra, Andros, Sakti, Sakriani, Nakamura, Satoshi

论文摘要

先前的研究提出了一个机器语音链，以实现自动语音识别（ASR）和文本到语音综合（TTS），以在半监督的学习中相互协助，并避免需要大量配对的语音和文本数据。但是，该框架仍然需要大量未配对（语音或文本）数据。然后探索了原型多模式链，以进一步减少对大量未配对数据的需求，即使没有更多的语音或文本数据，也可以改善ASR或TTS。不幸的是，该框架取决于图像检索（IR）模型，因此仅限于处理训练期间已经知道的那些图像。此外，仅使用单扬声器人工语音数据来研究该框架的性能。在这项研究中，我们使用图像生成（IG）改造了多模式机链框架，并研究了在多宣言自然语音数据上使用单环和双环体系结构来增强ASR和TTS图像数据的可能性。实验结果表明，单环和双环多模式链框架启用了ASR和TTS，可以使用仅图像的数据集提高其性能。

Previous research has proposed a machine speech chain to enable automatic speech recognition (ASR) and text-to-speech synthesis (TTS) to assist each other in semi-supervised learning and to avoid the need for a large amount of paired speech and text data. However, that framework still requires a large amount of unpaired (speech or text) data. A prototype multimodal machine chain was then explored to further reduce the need for a large amount of unpaired data, which could improve ASR or TTS even when no more speech or text data were available. Unfortunately, this framework relied on the image retrieval (IR) model, and thus it was limited to handling only those images that were already known during training. Furthermore, the performance of this framework was only investigated with single-speaker artificial speech data. In this study, we revamp the multimodal machine chain framework with image generation (IG) and investigate the possibility of augmenting image data for ASR and TTS using single-loop and dual-loop architectures on multispeaker natural speech data. Experimental results revealed that both single-loop and dual-loop multimodal chain frameworks enabled ASR and TTS to improve their performance using an image-only dataset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题