一项关于跨科普斯语音情绪识别和数据增强的研究

论文标题

一项关于跨科普斯语音情绪识别和数据增强的研究

A study on cross-corpus speech emotion recognition and data augmentation

论文作者

Braunschweiler, Norbert, Doddipatla, Rama, Keizer, Simon, Stoyanchev, Svetlana

论文摘要

可以处理各种扬声器和声学条件的模型对于语音情感识别（SER）至关重要。通常，这些模型在训练过程中不可见的说话者或声学条件时会显示出不同的结果。本文研究了跨科普斯数据互补和数据增强对匹配（来自同一语料库的测试集）和不匹配（来自不同语料库的测试集）中SER模型性能的影响。提出了使用六个情感语音语料库的调查，其中包括单个和多个演讲者以及情感风格的变化（表现，引起，自然）和记录条件。观察结果表明，正如预期的那样，接受单个语料库训练的模型在匹配的条件下表现最佳，而在不匹配条件下的性能在10-40％之间，具体取决于特定的特定功能。在混合语料库中训练的模型在不匹配的环境中可能更稳定，与在匹配条件下的单个语料库模型相比，性能降低范围从1％到8％。数据增强可产生多达4％的额外收益，并且似乎比匹配的条件更受益。

Models that can handle a wide range of speakers and acoustic conditions are essential in speech emotion recognition (SER). Often, these models tend to show mixed results when presented with speakers or acoustic conditions that were not visible during training. This paper investigates the impact of cross-corpus data complementation and data augmentation on the performance of SER models in matched (test-set from same corpus) and mismatched (test-set from different corpus) conditions. Investigations using six emotional speech corpora that include single and multiple speakers as well as variations in emotion style (acted, elicited, natural) and recording conditions are presented. Observations show that, as expected, models trained on single corpora perform best in matched conditions while performance decreases between 10-40% in mismatched conditions, depending on corpus specific features. Models trained on mixed corpora can be more stable in mismatched contexts, and the performance reductions range from 1 to 8% when compared with single corpus models in matched conditions. Data augmentation yields additional gains up to 4% and seem to benefit mismatched conditions more than matched ones.

下载PDF全文

下载文献需遵守相关版权规定

论文标题