视觉自我判断是否可以改善对情感识别的语音表征的学习？

论文标题

视觉自我判断是否可以改善对情感识别的语音表征的学习？

Does Visual Self-Supervision Improve Learning of Speech Representations for Emotion Recognition?

论文作者

Shukla, Abhinav, Petridis, Stavros, Pantic, Maja

论文摘要

自我监督的学习吸引了许多最近的研究兴趣。但是，大多数在语音中进行自学的作品通常是单峰的，并且有限的工作研究音频和视觉方式之间的相互作用是跨模式的自学。这项工作（1）通过面部重建调查了视觉自学，以指导音频表示的学习；（2）提出了一种仅仅是音频的自学方法，以进行语音表示学习；（3）表明，所提出的视觉和音频自我审视的多任务组合有益于学习在嘈杂条件下更健壮的更丰富的特征。（4）表明，自我监督预处理可以胜过完全监督的培训，并且对于防止在较小尺寸的数据集上过度拟合特别有用。我们评估我们学到的音频表示形式，以进行离散的情绪识别，持续影响识别和自动语音识别。对于所有经过测试的下游任务，我们的表现要优于现有的自我监督方法。我们的结果证明了视觉自我划定在音频功能学习中的潜力，并建议联合的视觉和音频自我审视会导致语音和情感识别的更多信息音频表示。

Self-supervised learning has attracted plenty of recent research interest. However, most works for self-supervision in speech are typically unimodal and there has been limited work that studies the interaction between audio and visual modalities for cross-modal self-supervision. This work (1) investigates visual self-supervision via face reconstruction to guide the learning of audio representations; (2) proposes an audio-only self-supervision approach for speech representation learning; (3) shows that a multi-task combination of the proposed visual and audio self-supervision is beneficial for learning richer features that are more robust in noisy conditions; (4) shows that self-supervised pretraining can outperform fully supervised training and is especially useful to prevent overfitting on smaller sized datasets. We evaluate our learned audio representations for discrete emotion recognition, continuous affect recognition and automatic speech recognition. We outperform existing self-supervised methods for all tested downstream tasks. Our results demonstrate the potential of visual self-supervision for audio feature learning and suggest that joint visual and audio self-supervision leads to more informative audio representations for speech and emotion recognition.

下载PDF全文

下载文献需遵守相关版权规定

论文标题