多尺度功能融合变压器网络，用于端到端单通道语音分离

论文标题

多尺度功能融合变压器网络，用于端到端单通道语音分离

Multi-Scale Feature Fusion Transformer Network for End-to-End Single Channel Speech Separation

论文作者

Xu, Yinhao, Zhou, Jian, Tao, Liang, Kwan, Hon Keung

论文摘要

最近，关于时间域音频分离网络（TASNET）的研究在语音分离方面取得了长足的进步。最具代表性的TASNET之一是具有双路分割方法的网络。但是，在网络的所有层中，称为DPRNN的原始模型都使用了固定的特征维度和不变的片段大小。在本文中，我们根据单渠道语音分离的常规双路径结构提出了一个多尺度特征融合变压器网络（MSFFT-NET）。与只有一个处理路径的常规双路径结构不同，采用了几个具有替代性内界和界面间操作的迭代块来捕获局部和全局上下文信息，该提出的MSFFT-NET具有多个平行处理路径，其中可以在多个平行处理路径之间交换特征信息。实验表明，基于多尺度特征融合结构的我们提出的网络比基准数据集合数据集中的原始双路径模型取得了更好的结果，其中MSFFT-3P的SI-SNRI得分为20.7dB（1.47％改进），而MSFFFT-2P是21.0db（3.45％），以下是21.45％，以下是2-2-2.45％。任何数据增强方法。

Recently studies on time-domain audio separation networks (TasNets) have made a great stride in speech separation. One of the most representative TasNets is a network with a dual-path segmentation approach. However, the original model called DPRNN used a fixed feature dimension and unchanged segment size throughout all layers of the network. In this paper, we propose a multi-scale feature fusion transformer network (MSFFT-Net) based on the conventional dual-path structure for single-channel speech separation. Unlike the conventional dual-path structure where only one processing path exists, adopting several iterative blocks with alternative intra-chunk and inter-chunk operations to capture local and global context information, the proposed MSFFT-Net has multiple parallel processing paths where the feature information can be exchanged between multiple parallel processing paths. Experiments show that our proposed networks based on multi-scale feature fusion structure have achieved better results than the original dual-path model on the benchmark dataset-WSJ0-2mix, where the SI-SNRi score of MSFFT-3P is 20.7dB (1.47% improvement), and MSFFT-2P is 21.0dB (3.45% improvement), which achieves SOTA on WSJ0-2mix without any data augmentation method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题