论文标题
CPED:一个大规模的中国个性化和情感对话数据集,用于对话AI
CPED: A Large-Scale Chinese Personalized and Emotional Dialogue Dataset for Conversational AI
论文作者
论文摘要
人类语言表达是基于情况的主观约束,而不是客观真理条件,这意味着说话者在认知处理后的个性和情感对对话具有重要影响。但是,大多数现有的用于对话式AI的数据集忽略了人类的个性和情感,或者仅考虑其中的一部分。尽管大规模培训语言模型已被广泛使用,但对话系统很难理解说话者的个性和情感。为了在对话生成过程中同时考虑个性和情感,我们提出了一个大规模的中国个性化和情感对话数据集CPED,其中包括与同理心和个人特征有关的多元源知识。这些知识涵盖了性别,五个个性特征,13个情感,19个对话行为和10个场景。 CPED包含40场电视节目中392位扬声器的12K对话。我们根据版权索赔,隐私问题,视频平台服务条款发布具有音频功能和视频功能的文本数据集。我们提供了CPED施工过程的详细描述,并介绍了三个任务以进行对话AI,包括个性识别,对话中的情感识别以及个性化和情感对话的产生。最后,我们为这些任务提供基线系统,并考虑说话者的性格和情感在对话中的功能。我们的动机是提议将数据集广泛地被NLP社区广泛采用,作为对话性AI研究的新开放基准。完整数据集可在https://github.com/scutcyr/cped上找到。
Human language expression is based on the subjective construal of the situation instead of the objective truth conditions, which means that speakers' personalities and emotions after cognitive processing have an important influence on conversation. However, most existing datasets for conversational AI ignore human personalities and emotions, or only consider part of them. It's difficult for dialogue systems to understand speakers' personalities and emotions although large-scale pre-training language models have been widely used. In order to consider both personalities and emotions in the process of conversation generation, we propose CPED, a large-scale Chinese personalized and emotional dialogue dataset, which consists of multi-source knowledge related to empathy and personal characteristic. These knowledge covers gender, Big Five personality traits, 13 emotions, 19 dialogue acts and 10 scenes. CPED contains more than 12K dialogues of 392 speakers from 40 TV shows. We release the textual dataset with audio features and video features according to the copyright claims, privacy issues, terms of service of video platforms. We provide detailed description of the CPED construction process and introduce three tasks for conversational AI, including personality recognition, emotion recognition in conversations as well as personalized and emotional conversation generation. Finally, we provide baseline systems for these tasks and consider the function of speakers' personalities and emotions on conversation. Our motivation is to propose a dataset to be widely adopted by the NLP community as a new open benchmark for conversational AI research. The full dataset is available at https://github.com/scutcyr/CPED.