通过有条件的DSVAE改善零击的语音转换

论文标题

通过有条件的DSVAE改善零击的语音转换

Towards Improved Zero-shot Voice Conversion with Conditional DSVAE

论文作者

Lian, Jiachen, Zhang, Chunlei, Anumanchipalli, Gopala Krishna, Yu, Dong

论文摘要

删除内容和说话样式信息对于零击的非并行语音转换（VC）至关重要。我们先前的研究调查了一个新型框架，该框架具有分离的顺序变分自动编码器（DSVAE），作为信息分解的骨干。我们已经证明，对于零拍的VC来说，同时解开嵌入内容的内容和嵌入的说话者是可行的。在这项研究中，我们通过提出对DSVAE基线中内容分支的先前分布的关注来继续方向。我们发现随机初始化的先验分布将迫使内容嵌入以减少学习过程中的语音结构信息，这不是所需的属性。在这里，我们试图获得更好的内容，并保留更多的语音信息。我们提出了条件DSVAE，这是一个新模型，可以使内容偏置作为先验建模的条件，并重塑从后分布中采样的内容。在VCTK数据集的实验中，我们证明了从条件DSVAE中得出的内容嵌入可以克服随机性，并获得更好的音素分类精度，稳定的发声和与竞争性DSVAE基线相比，稳定的发声和更好的零击VC性能。

Disentangling content and speaking style information is essential for zero-shot non-parallel voice conversion (VC). Our previous study investigated a novel framework with disentangled sequential variational autoencoder (DSVAE) as the backbone for information decomposition. We have demonstrated that simultaneous disentangling content embedding and speaker embedding from one utterance is feasible for zero-shot VC. In this study, we continue the direction by raising one concern about the prior distribution of content branch in the DSVAE baseline. We find the random initialized prior distribution will force the content embedding to reduce the phonetic-structure information during the learning process, which is not a desired property. Here, we seek to achieve a better content embedding with more phonetic information preserved. We propose conditional DSVAE, a new model that enables content bias as a condition to the prior modeling and reshapes the content embedding sampled from the posterior distribution. In our experiment on the VCTK dataset, we demonstrate that content embeddings derived from the conditional DSVAE overcome the randomness and achieve a much better phoneme classification accuracy, a stabilized vocalization and a better zero-shot VC performance compared with the competitive DSVAE baseline.

下载PDF全文

下载文献需遵守相关版权规定

论文标题