论文标题
视听多渠道语音分离,横断和识别
Audio-visual multi-channel speech separation, dereverberation and recognition
论文作者
论文摘要
尽管自动语音识别(ASR)技术迅速发展,但对鸡尾酒会言语的准确认可为以重叠的扬声器的干扰,背景噪音和房间混响仍然是迄今为止艰巨的任务。由于视觉模态对声学信号腐败的不变性的动机,尽管主要针对重叠的语音分离和识别任务,但已经开发了视听语音增强技术。在本文中,提出了视听多通道语音分离,横断和识别方法,其中提出了将视觉信息完全融合到系统的所有三个阶段中。基于DNN-WPE和光谱映射的两种神经覆盖方法,仅使用音频而不是使用音频的优点。通过在MSE和LF-MMI标准上进行微调,将分离和消失模型之间的学习成本函数不匹配及其与后端识别系统的集成。在LRS2数据集上进行的实验表明,提出的视听多渠道语音分离,冲突和识别系统的表现优于基线音频 - 视频 - 视频频道语音分离和识别系统,无需统计上的误差率(WER)重新降低了2.06%的统计误差率(WER)绝对(WER)绝对(WER)绝对(WER)重新率高(WER)。
Despite the rapid advance of automatic speech recognition (ASR) technologies, accurate recognition of cocktail party speech characterised by the interference from overlapping speakers, background noise and room reverberation remains a highly challenging task to date. Motivated by the invariance of visual modality to acoustic signal corruption, audio-visual speech enhancement techniques have been developed, although predominantly targeting overlapping speech separation and recognition tasks. In this paper, an audio-visual multi-channel speech separation, dereverberation and recognition approach featuring a full incorporation of visual information into all three stages of the system is proposed. The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches based on DNN-WPE and spectral mapping respectively. The learning cost function mismatch between the separation and dereverberation models and their integration with the back-end recognition system is minimised using fine-tuning on the MSE and LF-MMI criteria. Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline audio-visual multi-channel speech separation and recognition system containing no dereverberation module by a statistically significant word error rate (WER) reduction of 2.06% absolute (8.77% relative).