用谐波信号改善基于对抗波形的歌声转换

论文标题

用谐波信号改善基于对抗波形的歌声转换

Improving Adversarial Waveform Generation based Singing Voice Conversion with Harmonic Signals

论文作者

Guo, Haohan, Zhou, Zhiping, Meng, Fanbo, Liu, Kai

论文摘要

对抗性波形的产生一直是一种流行的方法，作为唱歌语音转换（SVC）的后端，可以产生高质量的歌声音频。但是，GAN的不稳定性也导致了其他问题，例如俯仰抖动和U/V错误。它影响谐波的平稳性和连续性，因此严重降低了转换质量。本文提议提前向SVC模型提供谐波信号，以增强音频的产生。我们从音高提取正弦激发，并用由神经网络估计的线性时间变化（LTV）过滤器过滤。这两个谐波信号都被用作产生唱歌波形的输入。在我们的实验中，研究了两个主流模型，即梅尔根和平行波，以验证所提出的方法的有效性。我们对清洁和嘈杂的测试集进行了MOS测试。结果表明，这两个信号在忠诚度和音色相似性方面都显着提高了SVC。此外，案例分析进一步验证了该方法增强了生成音频中谐波的平滑度和连续性，并且过滤的激发更好地匹配了目标音频。

Adversarial waveform generation has been a popular approach as the backend of singing voice conversion (SVC) to generate high-quality singing audio. However, the instability of GAN also leads to other problems, such as pitch jitters and U/V errors. It affects the smoothness and continuity of harmonics, hence degrades the conversion quality seriously. This paper proposes to feed harmonic signals to the SVC model in advance to enhance audio generation. We extract the sine excitation from the pitch, and filter it with a linear time-varying (LTV) filter estimated by a neural network. Both these two harmonic signals are adopted as the inputs to generate the singing waveform. In our experiments, two mainstream models, MelGAN and ParallelWaveGAN, are investigated to validate the effectiveness of the proposed approach. We conduct a MOS test on clean and noisy test sets. The result shows that both signals significantly improve SVC in fidelity and timbre similarity. Besides, the case analysis further validates that this method enhances the smoothness and continuity of harmonics in the generated audio, and the filtered excitation better matches the target audio.

下载PDF全文

下载文献需遵守相关版权规定

论文标题