论文标题
Avocodo:无伪影的生成对抗网络
Avocodo: Generative Adversarial Network for Artifact-free Vocoder
论文作者
论文摘要
基于生成对抗性神经网络(GAN)的神经声码器由于其快速推理速度和轻质网络而被广泛使用,同时产生了高质量的语音波形。由于感知上重要的语音成分主要集中在低频频段中,因此大多数基于GAN的Vocoders进行了多尺度分析,以评估倒数采样的语音波形。这种多尺度分析有助于发电机提高语音清晰度。但是,在初步实验中,我们发现,侧重于低频频段的多尺度分析会导致意外的人工制品,例如,降级和成像伪像,降低了综合性语音波形质量。因此,在本文中,我们研究了这些伪像和基于gan的声码器之间的关系,并提出了一个基于GAN的Vocoder,称为Avocodo,允许综合使用减少的人工制品的高保真语音。我们介绍了两种歧视者,以各种视角评估语音波形:协作性多波段歧视者和一个子频段歧视者。我们还利用伪正交镜面过滤器库来获取下采样的多波段语音波形,同时避免混音。根据实验结果,鳄梨在客观和主观上都优于基线基线的声码器,同时以较少的伪影复制语音。
Neural vocoders based on the generative adversarial neural network (GAN) have been widely used due to their fast inference speed and lightweight networks while generating high-quality speech waveforms. Since the perceptually important speech components are primarily concentrated in the low-frequency bands, most GAN-based vocoders perform multi-scale analysis that evaluates downsampled speech waveforms. This multi-scale analysis helps the generator improve speech intelligibility. However, in preliminary experiments, we discovered that the multi-scale analysis which focuses on the low-frequency bands causes unintended artifacts, e.g., aliasing and imaging artifacts, which degrade the synthesized speech waveform quality. Therefore, in this paper, we investigate the relationship between these artifacts and GAN-based vocoders and propose a GAN-based vocoder, called Avocodo, that allows the synthesis of high-fidelity speech with reduced artifacts. We introduce two kinds of discriminators to evaluate speech waveforms in various perspectives: a collaborative multi-band discriminator and a sub-band discriminator. We also utilize a pseudo quadrature mirror filter bank to obtain downsampled multi-band speech waveforms while avoiding aliasing. According to experimental results, Avocodo outperforms baseline GAN-based vocoders, both objectively and subjectively, while reproducing speech with fewer artifacts.