论文标题
CleanFormer:智能扬声器中ASR的多通道阵列构型不变性神经增强前端
Cleanformer: A multichannel array configuration-invariant neural enhancement frontend for ASR in smart speakers
论文作者
论文摘要
这项工作介绍了CleanFormer,这是一种流媒体多通道神经的增强前端,用于自动语音识别(ASR)。该模型具有基于构象异构体的架构,该体系结构将每个通道输入一个原始和增强的信号,并使用自我注意来推导时间频面掩码。增强的输入是通过称为语音清洁器的多通道自适应噪声消除算法生成的,该算法利用噪声上下文来推导其滤波器的滤水。时间频面掩码被应用于嘈杂的输入,以产生ASR增强的输出功能。详细评估在基于语音和非语音的噪声中使用模拟和重新录制的数据集进行了介绍,这些噪声显示出使用大规模的最先进的ASR模型时的单词错误率(WER)显着降低。它也将证明使用带有理想转向的波束形式可以显着优于增强功能。增强模型是麦克风数量和阵列配置数量的不可知论,因此可以与不同的麦克风阵列一起使用,而无需重新训练。证明性能可以通过更多的麦克风提高,最多4个,每个麦克风都提供较小的边际收益。具体而言,对于-6dB的SNR,在两个噪声条件下显示了约80 \%的相对改善。
This work introduces the Cleanformer, a streaming multichannel neural based enhancement frontend for automatic speech recognition (ASR). This model has a conformer-based architecture which takes as inputs a single channel each of raw and enhanced signals, and uses self-attention to derive a time-frequency mask. The enhanced input is generated by a multichannel adaptive noise cancellation algorithm known as Speech Cleaner, which makes use of noise context to derive its filter taps. The time-frequency mask is applied to the noisy input to produce enhanced output features for ASR. Detailed evaluations are presented with simulated and re-recorded datasets in speech-based and non-speech-based noise that show significant reduction in word error rate (WER) when using a large-scale state-of-the-art ASR model. It also will be shown to significantly outperform enhancement using a beamformer with ideal steering. The enhancement model is agnostic of the number of microphones and array configuration and, therefore, can be used with different microphone arrays without the need for retraining. It is demonstrated that performance improves with more microphones, up to 4, with each additional microphone providing a smaller marginal benefit. Specifically, for an SNR of -6dB, relative WER improvements of about 80\% are shown in both noise conditions.