Diffwave：音频合成的多功能扩散模型

论文标题

Diffwave：音频合成的多功能扩散模型

DiffWave: A Versatile Diffusion Model for Audio Synthesis

论文作者

Kong, Zhifeng, Ping, Wei, Huang, Jiaji, Zhao, Kexin, Catanzaro, Bryan

论文摘要

在这项工作中，我们提出了Diffwave，这是一种用于条件和无条件波形生成的多功能扩散概率模型。该模型是非自动回旋的，并通过马尔可夫链将白噪声信号转换为结构化波形，并具有恒定的综合步骤数。通过优化数据可能性上的变异变体来有效地训练它。 Diffwave在不同的波形生成任务中产生高保真音频，包括以MEL频谱图，阶级条件生成和无条件生成为条件的神经声音编码。我们证明，Diffwave的语音质量（MOS：4.44对4.43）与强大的WaveNet Vocoder相匹配，同时合成数量级的速度更快。特别是，它在具有挑战性的无条件生成任务中，在音频质量和各种自动评估中的样本多样性方面，它大大优于自回归和基于GAN的波形模型。

In this work, we propose DiffWave, a versatile diffusion probabilistic model for conditional and unconditional waveform generation. The model is non-autoregressive, and converts the white noise signal into structured waveform through a Markov chain with a constant number of steps at synthesis. It is efficiently trained by optimizing a variant of variational bound on the data likelihood. DiffWave produces high-fidelity audios in different waveform generation tasks, including neural vocoding conditioned on mel spectrogram, class-conditional generation, and unconditional generation. We demonstrate that DiffWave matches a strong WaveNet vocoder in terms of speech quality (MOS: 4.44 versus 4.43), while synthesizing orders of magnitude faster. In particular, it significantly outperforms autoregressive and GAN-based waveform models in the challenging unconditional generation task in terms of audio quality and sample diversity from various automatic and human evaluations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题