通过渠道的子带输入，以更好的声音和伴奏分离高分辨率音乐

论文标题

通过渠道的子带输入，以更好的声音和伴奏分离高分辨率音乐

Channel-wise Subband Input for Better Voice and Accompaniment Separation on High Resolution Music

论文作者

Liu, Haohe, Xie, Lei, Wu, Jian, Yang, Geng

论文摘要

本文为基于频率域中的基于卷积神经网络（CNN）的音乐源分离（MSS）模型提供了一种新的输入格式，渠道的子带输入（CWS）。我们旨在解决基于CNN的高分辨率MSS模型中的主要问题：高度计算成本和重量分享之间截然不同的频段。具体而言，在本文中，我们将输入混合物光谱分解为几个频段，并将它们作为模型输入加成。提出的方法可以在每个子带中有效分享重量共享，并在频道之间引入更大的灵活性。为了进行比较，我们在具有不同尺度，体系结构和CWS设置的模型上执行语音和伴奏分离（VAS）。实验表明，CWS输入在许多方面都是有益的。我们评估了我们的MUSDB18HQ测试集的方法，重点是SDR，SIR和SAR指标。在我们所有实验中，CWS使模型能够在平均指标上获得6.9％的性能增长。随着较少的参数，较少的训练数据和较短的培训时间，带有8波段CWS输入的Mdensenet仍然超过了原始的MMDENSENET，并具有很大的边距。此外，CWS还在很大程度上减少了计算成本和培训时间。

This paper presents a new input format, channel-wise subband input (CWS), for convolutional neural networks (CNN) based music source separation (MSS) models in the frequency domain. We aim to address the major issues in CNN-based high-resolution MSS model: high computational cost and weight sharing between distinctly different bands. Specifically, in this paper, we decompose the input mixture spectra into several bands and concatenate them channel-wise as the model input. The proposed approach enables effective weight sharing in each subband and introduces more flexibility between channels. For comparison purposes, we perform voice and accompaniment separation (VAS) on models with different scales, architectures, and CWS settings. Experiments show that the CWS input is beneficial in many aspects. We evaluate our method on musdb18hq test set, focusing on SDR, SIR and SAR metrics. Among all our experiments, CWS enables models to obtain 6.9% performance gain on the average metrics. With even a smaller number of parameters, less training data, and shorter training time, our MDenseNet with 8-bands CWS input still surpasses the original MMDenseNet with a large margin. Moreover, CWS also reduces computational cost and training time to a large extent.

下载PDF全文

下载文献需遵守相关版权规定

论文标题