BD-SHS：用于学习在不同社会环境中在线孟加拉国仇恨言论的基准数据集

论文标题

BD-SHS：用于学习在不同社会环境中在线孟加拉国仇恨言论的基准数据集

BD-SHS: A Benchmark Dataset for Learning to Detect Online Bangla Hate Speech in Different Social Contexts

论文作者

Romim, Nauros, Ahmed, Mosahed, Islam, Md. Saiful, Sharma, Arnab Sen, Talukder, Hriteshwar, Amin, Mohammad Ruhul

论文摘要

社交媒体平台和在线流媒体服务催生了新的仇恨言论（HS）。由于这些网站上的用户生成的内容大量，因此发现现代机器学习技术是可行的，并且具有成本效益，可以解决此问题。但是，语言上多样化的数据集涵盖了不同社会环境，其中通常需要使用进攻性语言来培训可推广的模型。在本文中，我们确定了现有的Bangla HS数据集的缺点，并引入了一个大型手动标记的数据集BD-SHS，其中包括在不同社交环境中的HS。标记标准是按照分层注释过程制备的，这是孟加拉HS中的第一个同类标准。该数据集包含从在线社交网站上拖走的50,200多个进攻性评论，并且比任何现有的Bangla HS数据集都至少大60％。我们通过训练不同的NLP模型来介绍数据集的基准结果，从而获得最佳的F1得分为91.0％。在我们的实验中，我们发现，与其他预训练的嵌入相比，使用社交媒体和流媒体网站的147万条评论专门使用147万条评论进行了培训的单词。我们的数据集和所有随附的代码可在github.com/naurosromim/hate-speech-dataset-for-bengali-social-social-media上公开获得

Social media platforms and online streaming services have spawned a new breed of Hate Speech (HS). Due to the massive amount of user-generated content on these sites, modern machine learning techniques are found to be feasible and cost-effective to tackle this problem. However, linguistically diverse datasets covering different social contexts in which offensive language is typically used are required to train generalizable models. In this paper, we identify the shortcomings of existing Bangla HS datasets and introduce a large manually labeled dataset BD-SHS that includes HS in different social contexts. The labeling criteria were prepared following a hierarchical annotation process, which is the first of its kind in Bangla HS to the best of our knowledge. The dataset includes more than 50,200 offensive comments crawled from online social networking sites and is at least 60% larger than any existing Bangla HS datasets. We present the benchmark result of our dataset by training different NLP models resulting in the best one achieving an F1-score of 91.0%. In our experiments, we found that a word embedding trained exclusively using 1.47 million comments from social media and streaming sites consistently resulted in better modeling of HS detection in comparison to other pre-trained embeddings. Our dataset and all accompanying codes is publicly available at github.com/naurosromim/hate-speech-dataset-for-Bengali-social-media

下载PDF全文

下载文献需遵守相关版权规定

论文标题