如何教DNNS注意语音识别中的视觉方式

论文标题

如何教DNNS注意语音识别中的视觉方式

How to Teach DNNs to Pay Attention to the Visual Modality in Speech Recognition

论文作者

Sterpu, George, Saam, Christian, Harte, Naomi

论文摘要

视听语音识别（AVSR）试图建模，从而利用人的声音与相应的嘴巴运动之间的动态关系。最近提出的一种基于序列神经网络的最新序列的多模式融合策略AV对齐，试图通过明确对准语音的声学和视觉表示来对这种关系进行建模。这项研究调查了AV对齐的内部工作，并可视化视听比对模式。我们的实验是在两个最大的公开可用AVSR数据集（TCD-TIMIT和LRS2）上进行的。我们发现，AV对准学会以通常单调的模式在TCD-TIMIT上在框架级别上的语音和视觉表示。我们还确定了最初在更具挑战性的LRS2上看到仅对音频的语音识别的原因。我们提出了一种正则化方法，该方法涉及从视觉表示中预测与唇部相关的动作单元。我们的正则化方法可更好地利用视觉方式，根据噪声水平，性能在7％至30％之间的提高。此外，我们表明替代手表，聆听，参加和咒语网络受与AV Align相同的问题的影响，并且我们提出的方法可以有效地帮助其学习视觉表示。我们的发现验证了正规化方法对AVSR的适用性，并鼓励研究人员在具有一种主要模式时重新考虑多模式收敛问题。

Audio-Visual Speech Recognition (AVSR) seeks to model, and thereby exploit, the dynamic relationship between a human voice and the corresponding mouth movements. A recently proposed multimodal fusion strategy, AV Align, based on state-of-the-art sequence to sequence neural networks, attempts to model this relationship by explicitly aligning the acoustic and visual representations of speech. This study investigates the inner workings of AV Align and visualises the audio-visual alignment patterns. Our experiments are performed on two of the largest publicly available AVSR datasets, TCD-TIMIT and LRS2. We find that AV Align learns to align acoustic and visual representations of speech at the frame level on TCD-TIMIT in a generally monotonic pattern. We also determine the cause of initially seeing no improvement over audio-only speech recognition on the more challenging LRS2. We propose a regularisation method which involves predicting lip-related Action Units from visual representations. Our regularisation method leads to better exploitation of the visual modality, with performance improvements between 7% and 30% depending on the noise level. Furthermore, we show that the alternative Watch, Listen, Attend, and Spell network is affected by the same problem as AV Align, and that our proposed approach can effectively help it learn visual representations. Our findings validate the suitability of the regularisation method to AVSR and encourage researchers to rethink the multimodal convergence problem when having one dominant modality.

下载PDF全文

下载文献需遵守相关版权规定

论文标题