论文标题
Autospeech:神经建筑搜索说话者识别
AutoSpeech: Neural Architecture Search for Speaker Recognition
论文作者
论文摘要
基于卷积神经网络(CNN)的扬声器识别系统通常由诸如VGG-NET或RESNET等现成的骨干构建。但是,这些骨架最初是为图像分类而提出的,因此可能不适合说话者的识别。由于手动探索设计空间的高度复杂性,我们为说话者识别任务(称为AutoSpeech)提出了第一种神经体系结构搜索方法。我们的算法首先识别神经细胞中的最佳操作组合,然后通过堆叠神经细胞多次来衍生CNN模型。最终的说话者识别模型可以通过通过标准方案训练派生的CNN模型来获得。为了评估所提出的方法,我们使用Voxceleb1数据集对说话者识别和说话者验证任务进行实验。结果表明,提出的CNN体系结构基于VGG-M,RESNET-18和RESNET-34背部造型明显优于当前说话者识别系统,同时享受较低的模型复杂性。
Speaker recognition systems based on Convolutional Neural Networks (CNNs) are often built with off-the-shelf backbones such as VGG-Net or ResNet. However, these backbones were originally proposed for image classification, and therefore may not be naturally fit for speaker recognition. Due to the prohibitive complexity of manually exploring the design space, we propose the first neural architecture search approach approach for the speaker recognition tasks, named as AutoSpeech. Our algorithm first identifies the optimal operation combination in a neural cell and then derives a CNN model by stacking the neural cell for multiple times. The final speaker recognition model can be obtained by training the derived CNN model through the standard scheme. To evaluate the proposed approach, we conduct experiments on both speaker identification and speaker verification tasks using the VoxCeleb1 dataset. Results demonstrate that the derived CNN architectures from the proposed approach significantly outperform current speaker recognition systems based on VGG-M, ResNet-18, and ResNet-34 back-bones, while enjoying lower model complexity.