捷克语数据集用于跨语言主观性分类

论文标题

捷克语数据集用于跨语言主观性分类

Czech Dataset for Cross-lingual Subjectivity Classification

论文作者

Přibáň, Pavel, Steinberger, Josef

论文摘要

在本文中，我们介绍了一个新的捷克主观性数据集，该数据集是从电影评论和描述中手动注释的主观和客观句子。我们的主要动机是提供可靠的数据集，可以与现有的英语数据集一起使用，以测试预训练的多语言模型在捷克和英语之间传输知识的能力，反之亦然。两个注释者的注释数据集达到了Cohen的\ k {appa} inter-nottator协议的0.83。据我们所知，这是捷克语语言的第一个主观性数据集。我们还创建了一个由200k自动标记的句子组成的附加数据集。这两个数据集都是免费用于研究目的的。此外，我们微调了五个预训练的BERT样模型，为新数据集设置了单语基线，并达到了93.56％的精度。我们在现有的英语数据集中微调模型，我们获得了与当前最新结果相当的结果。最后，我们在捷克语和英语之间执行零拍的跨语性主观分类，以验证数据集作为跨语性基准的可用性。我们比较和讨论跨语性和单语的结果以及多语言模型在语言之间转移知识的能力。

In this paper, we introduce a new Czech subjectivity dataset of 10k manually annotated subjective and objective sentences from movie reviews and descriptions. Our prime motivation is to provide a reliable dataset that can be used with the existing English dataset as a benchmark to test the ability of pre-trained multilingual models to transfer knowledge between Czech and English and vice versa. Two annotators annotated the dataset reaching 0.83 of the Cohen's \k{appa} inter-annotator agreement. To the best of our knowledge, this is the first subjectivity dataset for the Czech language. We also created an additional dataset that consists of 200k automatically labeled sentences. Both datasets are freely available for research purposes. Furthermore, we fine-tune five pre-trained BERT-like models to set a monolingual baseline for the new dataset and we achieve 93.56% of accuracy. We fine-tune models on the existing English dataset for which we obtained results that are on par with the current state-of-the-art results. Finally, we perform zero-shot cross-lingual subjectivity classification between Czech and English to verify the usability of our dataset as the cross-lingual benchmark. We compare and discuss the cross-lingual and monolingual results and the ability of multilingual models to transfer knowledge between languages.

下载PDF全文

下载文献需遵守相关版权规定

论文标题