论文标题
MFA-Conformer:自动扬声器验证的多尺度功能聚合构象异构体
MFA-Conformer: Multi-scale Feature Aggregation Conformer for Automatic Speaker Verification
论文作者
论文摘要
在本文中,我们介绍了多尺度特征聚合构象异构体(MFA-Conformer),这是一种易于实施的,简单但有效的主链,用于基于卷积功能增强的变压器(构象异构体)的自动扬声器验证。 MFA构造器的体系结构的灵感来自语音识别和扬声器验证中的最新模型。首先,我们引入了一个卷积子采样层,以降低模型的计算成本。其次,我们采用结合变压器和卷积神经网络(CNN)的构型块来有效地捕获全球和局部特征。最后,将所有构型块中的输出特征图串联以在最终合并之前汇总多尺度表示。我们在广泛使用的基准上评估了MFA构造器。最佳系统分别在voxceleb1-o,sitw.dev和sitw.eval Set上获得0.64%,1.29%和1.63%的EER。在识别性能和推理速度方面,MFA构造器的表现显着优于流行的ECAPA-TDNN系统。最后但并非最不重要的一点是,消融研究清楚地表明,全球和局部特征学习的组合可以导致稳健而准确的说话者嵌入提取。我们还发布了代码以进行比较。
In this paper, we present Multi-scale Feature Aggregation Conformer (MFA-Conformer), an easy-to-implement, simple but effective backbone for automatic speaker verification based on the Convolution-augmented Transformer (Conformer). The architecture of the MFA-Conformer is inspired by recent stateof-the-art models in speech recognition and speaker verification. Firstly, we introduce a convolution subsampling layer to decrease the computational cost of the model. Secondly, we adopt Conformer blocks which combine Transformers and convolution neural networks (CNNs) to capture global and local features effectively. Finally, the output feature maps from all Conformer blocks are concatenated to aggregate multi-scale representations before final pooling. We evaluate the MFA-Conformer on the widely used benchmarks. The best system obtains 0.64%, 1.29% and 1.63% EER on VoxCeleb1-O, SITW.Dev, and SITW.Eval set, respectively. MFA-Conformer significantly outperforms the popular ECAPA-TDNN systems in both recognition performance and inference speed. Last but not the least, the ablation studies clearly demonstrate that the combination of global and local feature learning can lead to robust and accurate speaker embedding extraction. We have also released the code for future comparison.