论文标题
UNISYN:文本到语音和唱歌语音综合的端到端统一模型
UniSyn: An End-to-End Unified Model for Text-to-Speech and Singing Voice Synthesis
论文作者
论文摘要
文本到语音(TTS)和唱歌语音综合(SVS)旨在根据文本输入和音乐得分分别产生高质量的口语和唱歌声音。将TT和SV统一到单个系统中对于需要两者的应用程序至关重要。现有方法通常受到某些局限性,这些局限性依赖于同一人的唱歌和说话数据或多个任务模型的级联模型。为了解决这些问题,本文提出了一个简化的TT和SVS的优雅框架,名为Unisyn。这是一个端到端的统一模型,只能通过此人的唱歌或说话来使声音说话和唱歌。具体而言,在Unisyn中提出了多条件变异的自动编码器(MC-VAE),该自动编码器(MC-VAE)通过使用扬声器和样式相关的(即说话或唱歌)条件(即说话或唱歌)的两个独立的潜在子空间。此外,对Wasserstein距离限制的监督指导性vae和音色扰动被利用,以进一步散布扬声器的音色和风格。对两名演讲者和两名歌手进行的实验表明,Unisn可以在没有相应的训练数据的情况下产生自然的口语和唱歌声音。所提出的方法的表现优于最先进的端到端语音生成工作,这证明了Unisn的有效性和优势。
Text-to-speech (TTS) and singing voice synthesis (SVS) aim at generating high-quality speaking and singing voice according to textual input and music scores, respectively. Unifying TTS and SVS into a single system is crucial to the applications requiring both of them. Existing methods usually suffer from some limitations, which rely on either both singing and speaking data from the same person or cascaded models of multiple tasks. To address these problems, a simplified elegant framework for TTS and SVS, named UniSyn, is proposed in this paper. It is an end-to-end unified model that can make a voice speak and sing with only singing or speaking data from this person. To be specific, a multi-conditional variational autoencoder (MC-VAE), which constructs two independent latent sub-spaces with the speaker- and style-related (i.e. speak or sing) conditions for flexible control, is proposed in UniSyn. Moreover, supervised guided-VAE and timbre perturbation with the Wasserstein distance constraint are leveraged to further disentangle the speaker timbre and style. Experiments conducted on two speakers and two singers demonstrate that UniSyn can generate natural speaking and singing voice without corresponding training data. The proposed approach outperforms the state-of-the-art end-to-end voice generation work, which proves the effectiveness and advantages of UniSyn.