论文标题
改进的Rawnet,并使用原始波形进行特征地图缩放,用于独立于文本的扬声器验证
Improved RawNet with Feature Map Scaling for Text-independent Speaker Verification using Raw Waveforms
论文作者
论文摘要
深度学习的最新进展促进了直接输入原始波形的扬声器验证系统的设计。例如,Rawnet提取器从原始波形中嵌入扬声器,从而简化了过程管道并展示了竞争性能。在这项研究中,我们使用各种方法来缩放特征图来改善Rawnet。提出的机制利用了采用Sigmoid非线性函数的量表向量。它是指维数等于给定特征映射中过滤器数量的向量。使用比例向量,我们建议将特征映射倍增,添加性或两者兼而有之。此外,我们研究用SINCNET的SINC卷积层代替第一卷积层。在Voxceleb1评估数据集上进行的实验证明了所提出的方法的有效性,并且最佳性能系统与原始RAWNET相比将相等的错误率降低了一半。使用Voxceleb1-E和Voxceleb-H协议获得的扩展评估结果略优于现有的最新系统。
Recent advances in deep learning have facilitated the design of speaker verification systems that directly input raw waveforms. For example, RawNet extracts speaker embeddings from raw waveforms, which simplifies the process pipeline and demonstrates competitive performance. In this study, we improve RawNet by scaling feature maps using various methods. The proposed mechanism utilizes a scale vector that adopts a sigmoid non-linear function. It refers to a vector with dimensionality equal to the number of filters in a given feature map. Using a scale vector, we propose to scale the feature map multiplicatively, additively, or both. In addition, we investigate replacing the first convolution layer with the sinc-convolution layer of SincNet. Experiments performed on the VoxCeleb1 evaluation dataset demonstrate the effectiveness of the proposed methods, and the best performing system reduces the equal error rate by half compared to the original RawNet. Expanded evaluation results obtained using the VoxCeleb1-E and VoxCeleb-H protocols marginally outperform existing state-of-the-art systems.