基于深度学习的情感识别的数据增强技术的比较研究

论文标题

基于深度学习的情感识别的数据增强技术的比较研究

A Comparative Study of Data Augmentation Techniques for Deep Learning Based Emotion Recognition

论文作者

Shankar, Ravi, Kenfack, Abdouh Harouna, Somayazulu, Arjun, Venkataraman, Archana

论文摘要

语音中的自动情绪识别是一个长期存在的问题。虽然早期的情感识别工作依赖于手工制作的功能和简单的分类器，但该领域现在采用了使用深神经网络的端到端特征学习和分类。与这些模型同时，研究人员提出了几种数据增强技术，以提高现有标记数据集的大小和可变性。尽管该领域有许多开创性的贡献，但我们仍然对网络体系结构与数据增强的选择之间的相互作用有很糟糕的了解。此外，只有少数研究证明了在多个数据集中特定模型的普遍性，这是实现鲁棒性现实性能的先决条件。在本文中，我们对流行的深度学习方法进行了全面评估，以识别情感。为了消除偏差，我们使用VESUS数据集修复了模型体系结构和优化超参数，然后使用重复的5倍交叉验证来评估Iemocap和Crema-D数据集上的性能。我们的结果表明，语音信号中的远距离依赖性对于情绪识别至关重要，并且速度/速度的增强为整个模型提供了最强大的性能增长。

Automated emotion recognition in speech is a long-standing problem. While early work on emotion recognition relied on hand-crafted features and simple classifiers, the field has now embraced end-to-end feature learning and classification using deep neural networks. In parallel to these models, researchers have proposed several data augmentation techniques to increase the size and variability of existing labeled datasets. Despite many seminal contributions in the field, we still have a poor understanding of the interplay between the network architecture and the choice of data augmentation. Moreover, only a handful of studies demonstrate the generalizability of a particular model across multiple datasets, which is a prerequisite for robust real-world performance. In this paper, we conduct a comprehensive evaluation of popular deep learning approaches for emotion recognition. To eliminate bias, we fix the model architectures and optimization hyperparameters using the VESUS dataset and then use repeated 5-fold cross validation to evaluate the performance on the IEMOCAP and CREMA-D datasets. Our results demonstrate that long-range dependencies in the speech signal are critical for emotion recognition and that speed/rate augmentation offers the most robust performance gain across models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题