JDI-T：经过培训的持续时间通知的变压器，用于文本到语音，无明确的对齐方式

论文标题

JDI-T：经过培训的持续时间通知的变压器，用于文本到语音，无明确的对齐方式

JDI-T: Jointly trained Duration Informed Transformer for Text-To-Speech without Explicit Alignment

论文作者

Lim, Dan, Jang, Won, O, Gyeonghwan, Park, Heayoung, Kim, Bongwan, Yoon, Jaesam

论文摘要

我们提出了经过训练的持续时间通知变压器（JDI-T），这是一种馈送前进的变压器，其持续时间预测器共同训练而没有明确的比对，以便从输入文本产生声学特征序列。在这项工作中，受持续时间知情网络（例如Fastspeech和Durian）的成功启发，我们进一步将其顺序的两阶段训练管道简化为单级培训。具体而言，我们在关节训练期间从自动回归变压器飞出的音素持续时间，而不是预处理自回旋模型并将其用作音素持续时间提取器。据我们所知，这是第一个在单个训练管道中依靠预先训练的音素持续时间提取器而无需依靠预先训练的音素持续时间提取器的实施。我们与ESPNET-TTS培训的基线文本到语音（TTS）模型相比，我们评估了提出模型对公开可用的韩国单人扬声器语音（KSS）数据集的有效性。

We propose Jointly trained Duration Informed Transformer (JDI-T), a feed-forward Transformer with a duration predictor jointly trained without explicit alignments in order to generate an acoustic feature sequence from an input text. In this work, inspired by the recent success of the duration informed networks such as FastSpeech and DurIAN, we further simplify its sequential, two-stage training pipeline to a single-stage training. Specifically, we extract the phoneme duration from the autoregressive Transformer on the fly during the joint training instead of pretraining the autoregressive model and using it as a phoneme duration extractor. To our best knowledge, it is the first implementation to jointly train the feed-forward Transformer without relying on a pre-trained phoneme duration extractor in a single training pipeline. We evaluate the effectiveness of the proposed model on the publicly available Korean Single speaker Speech (KSS) dataset compared to the baseline text-to-speech (TTS) models trained by ESPnet-TTS.

下载PDF全文

下载文献需遵守相关版权规定

论文标题