TalkNet：完全跨跨非自动回旋语音综合模型

论文标题

TalkNet：完全跨跨非自动回旋语音综合模型

TalkNet: Fully-Convolutional Non-Autoregressive Speech Synthesis Model

论文作者

Beliaev, Stanislav, Rebryk, Yurii, Ginsburg, Boris

论文摘要

我们提出了Talknet，这是一种卷积的非自动回归神经模型，用于语音合成。该模型由两个前进卷积网络组成。第一个网络预测墨墨持续时间。通过根据预测的持续时间重复每个符号来扩展输入文本。第二个网络从扩展的文本中生成MEL光谱图。为了训练谱系持续时间预测器，我们使用基于预先训练的连接派时间分类（CTC）的语音识别模型将素式持续时间添加到训练数据集中。显式持续时间预测消除了单词跳过和重复。 LJSpeech数据集上的实验表明，语音质量几乎与自动回归模型相匹配。该模型非常紧凑 - 它具有1080万参数，几乎比目前的最新文本到语音模型低3倍。非自动性架构允许快速训练和推理。

We propose TalkNet, a convolutional non-autoregressive neural model for speech synthesis. The model consists of two feed-forward convolutional networks. The first network predicts grapheme durations. An input text is expanded by repeating each symbol according to the predicted duration. The second network generates a mel-spectrogram from the expanded text. To train a grapheme duration predictor, we add the grapheme duration to the training dataset using a pre-trained Connectionist Temporal Classification (CTC)-based speech recognition model. The explicit duration prediction eliminates word skipping and repeating. Experiments on the LJSpeech dataset show that the speech quality nearly matches auto-regressive models. The model is very compact -- it has 10.8M parameters, almost 3x less than the present state-of-the-art text-to-speech models. The non-autoregressive architecture allows for fast training and inference.

下载PDF全文

下载文献需遵守相关版权规定

论文标题