基于潜在变量转换的文本到语音综合使用扩散概率模型和变异自动编码器

论文标题

基于潜在变量转换的文本到语音综合使用扩散概率模型和变异自动编码器

Text-to-speech synthesis based on latent variable conversion using diffusion probabilistic model and variational autoencoder

论文作者

Yasuda, Yusuke, Toda, Tomoki

论文摘要

文本到语音综合（TTS）是将文本转换为语音的任务。驱动TT的两个因素是概率模型和潜在表示学习的进步。我们使用扩散概率模型和变异自动编码器（VAE）提出了一种基于潜在变量转换的TTS方法。在我们的TTS方法中，我们使用基于VAE的波形模型，该模型是一种扩散模型，该模型可预测从文本中的波形模型中潜在变量的分布，以及一个了解文本和语音潜在序列之间对齐的对齐模型。我们的方法通过与扩散的均值和方差参数建模，从而通过VAE确定目标分布来整合VAE的扩散。这个潜在的变量转换框架可能使我们能够灵活地合并各种潜在特征提取器。我们的实验表明，我们的方法对于拼字法和对齐误差较差的语言标签具有鲁棒性。

Text-to-speech synthesis (TTS) is a task to convert texts into speech. Two of the factors that have been driving TTS are the advancements of probabilistic models and latent representation learning. We propose a TTS method based on latent variable conversion using a diffusion probabilistic model and the variational autoencoder (VAE). In our TTS method, we use a waveform model based on VAE, a diffusion model that predicts the distribution of latent variables in the waveform model from texts, and an alignment model that learns alignments between the text and speech latent sequences. Our method integrates diffusion with VAE by modeling both mean and variance parameters with diffusion, where the target distribution is determined by approximation from VAE. This latent variable conversion framework potentially enables us to flexibly incorporate various latent feature extractors. Our experiments show that our method is robust to linguistic labels with poor orthography and alignment errors.

下载PDF全文

下载文献需遵守相关版权规定

论文标题