多模式的语音情感识别使用互联网和文本的互联

论文标题

多模式的语音情感识别使用互联网和文本的互联

Multimodal Speech Emotion Recognition using Cross Attention with Aligned Audio and Text

论文作者

Lee, Yoonhyung, Yoon, Seunghyun, Jung, Kyomin

论文摘要

在本文中，我们提出了一种新型的语音情感识别模型，称为跨注意网络（CAN），该模型使用音频和文本信号作为输入。它的灵感来自于人类将言语认识为同时产生的声学和文本信号的组合。首先，我们的方法以对齐方式将音频和基础文本信号段分为相等的步骤，以便顺序信号的相同时间步长覆盖信号中相同的时间跨度。与此技术一起，我们将交叉注意力应用于对齐信号的顺序信息。在交叉注意的情况下，每种模式都通过将全局注意机制应用于每种模态而独立地汇总。然后，以交叉方式将每种模式的注意力权重直接应用于其他方式，以便可以根据每种模式从相同时间步骤中收集音频和文本信息。在标准IEMOCAP数据集进行的实验中，就加权和未加权的准确性而言，我们的模型相对胜过最新的系统2.66％和3.18％。

In this paper, we propose a novel speech emotion recognition model called Cross Attention Network (CAN) that uses aligned audio and text signals as inputs. It is inspired by the fact that humans recognize speech as a combination of simultaneously produced acoustic and textual signals. First, our method segments the audio and the underlying text signals into equal number of steps in an aligned way so that the same time steps of the sequential signals cover the same time span in the signals. Together with this technique, we apply the cross attention to aggregate the sequential information from the aligned signals. In the cross attention, each modality is aggregated independently by applying the global attention mechanism onto each modality. Then, the attention weights of each modality are applied directly to the other modality in a crossed way, so that the CAN gathers the audio and text information from the same time steps based on each modality. In the experiments conducted on the standard IEMOCAP dataset, our model outperforms the state-of-the-art systems by 2.66% and 3.18% relatively in terms of the weighted and unweighted accuracy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题