论文标题
通过自我发作的VAE增强零射击的许多语音转换
Enhancing Zero-Shot Many to Many Voice Conversion with Self-Attention VAE
论文作者
论文摘要
变性自动编码器(VAE)是一种有效的神经网络架构,可将语音发言性解散到扬声器身份和语言内容潜在嵌入中,然后为目标发言人与源扬声器的语音产生话语。通过将目标扬声器的身份嵌入以及源说明句子的源头嵌入的内容嵌入,这是可能的。在这项工作中,我们建议通过自我注意和结构正则化(RGSM)改善VAE模型。具体而言,我们发现了VAE的解码器的合适位置,以添加一个自我发项层,以将非本地信息纳入产生转换的话语并隐藏源说话者的身份。我们应用了放松的小组分裂方法(RGSM),以使网络权重规范性并显着提高泛化性能。 在VCTK数据集上的零射击多次语音转换任务的实验中,具有自我发项式层和放松的小组分裂方法,我们的模型可获得28.3 \%的扬声器分类精度,而MOSNET分数则略微提高了转换语音质量。我们令人鼓舞的发现表明,未来的研究是关于在VAE框架中整合更多各种注意力结构的研究,同时控制模型大小和过度拟合,以推动零射击的多次声音转换。
Variational auto-encoder (VAE) is an effective neural network architecture to disentangle a speech utterance into speaker identity and linguistic content latent embeddings, then generate an utterance for a target speaker from that of a source speaker. This is possible by concatenating the identity embedding of the target speaker and the content embedding of the source speaker uttering a desired sentence. In this work, we propose to improve VAE models with self-attention and structural regularization (RGSM). Specifically, we found a suitable location of VAE's decoder to add a self-attention layer for incorporating non-local information in generating a converted utterance and hiding the source speaker's identity. We applied relaxed group-wise splitting method (RGSM) to regularize network weights and remarkably enhance generalization performance. In experiments of zero-shot many-to-many voice conversion task on VCTK data set, with the self-attention layer and relaxed group-wise splitting method, our model achieves a gain of speaker classification accuracy on unseen speakers by 28.3\% while slightly improved conversion voice quality in terms of MOSNet scores. Our encouraging findings point to future research on integrating more variety of attention structures in VAE framework while controlling model size and overfitting for advancing zero-shot many-to-many voice conversions.