论文标题
可学习的光谱时期的接收场,用于强大的语音类型歧视
Learnable Spectro-temporal Receptive Fields for Robust Voice Type Discrimination
论文作者
论文摘要
语音类型歧视(VTD)是指在录音中区域之间的歧视,在录音中,演讲者在录音设备(“ Live Speak”(“ Live Speece”)中产生的语音和其他类型的音频(例如交通噪音和电视广播)(“ Distractor Audio”)。在这项工作中,我们提出了一个基于深度学习的VTD系统,该系统具有可学习的频式接收场(STRF)的初始层。我们的方法还证明可以在ASVSPOOF 2019挑战赛中的类似欺骗检测任务上提供非常强大的性能。我们在收集的新标准化VTD数据库上评估了我们的方法,该数据库以支持该领域的研究。特别是,我们研究了与静态STRF或无约束核相比,使用可学习的STRF的效果。我们还表明,在存在VTD干扰器噪声的情况下,我们的系统始终改善了在欺骗检测的广泛信噪比上的竞争性基线系统。
Voice Type Discrimination (VTD) refers to discrimination between regions in a recording where speech was produced by speakers that are physically within proximity of the recording device ("Live Speech") from speech and other types of audio that were played back such as traffic noise and television broadcasts ("Distractor Audio"). In this work, we propose a deep-learning-based VTD system that features an initial layer of learnable spectro-temporal receptive fields (STRFs). Our approach is also shown to provide very strong performance on a similar spoofing detection task in the ASVspoof 2019 challenge. We evaluate our approach on a new standardized VTD database that was collected to support research in this area. In particular, we study the effect of using learnable STRFs compared to static STRFs or unconstrained kernels. We also show that our system consistently improves a competitive baseline system across a wide range of signal-to-noise ratios on spoofing detection in the presence of VTD distractor noise.