论文标题
表情符号作为锚定,以检测阿拉伯进攻语言和仇恨言论
Emojis as Anchors to Detect Arabic Offensive Language and Hate Speech
论文作者
论文摘要
我们介绍了一种通用的,无关的方法,可以收集大部分的进攻和仇恨推文,而无论其主题或流派如何。我们利用嵌入在表情符号中的语言外信息来收集大量的进攻推文。我们将提出的方法应用于阿拉伯语推文,并将其与英文推文进行比较 - 分析关键的文化差异。我们观察到这些表情符号的不断使用,以代表Twitter上不同时间班上的犯罪性。我们手动注释并公开发布最大的阿拉伯语数据集,用于进攻性,细粒度的仇恨言论,粗俗和暴力内容。此外,我们使用不同的变压器体系结构来检测攻击性和仇恨言论并进行深入的语言分析进行基准测试。我们在外部数据集上评估了我们的模型 - 使用完全不同的方法收集的Twitter数据集,以及一个包含Twitter,YouTube和Facebook评论的多平台数据集,用于评估概括能力。这些数据集的竞争结果表明,使用我们的方法收集的数据捕获了进攻性语言的普遍特征。我们的发现还强调了进攻沟通中使用的常见单词,仇恨言论的共同目标,暴力推文中的特定模式;以及可以归因于NLP模型的局限性的公共分类错误。我们观察到,即使是最先进的变压器模型也可能无法考虑文化,背景和环境,也可能无法理解诸如讽刺之类的现实世界中存在的细微差别。
We introduce a generic, language-independent method to collect a large percentage of offensive and hate tweets regardless of their topics or genres. We harness the extralinguistic information embedded in the emojis to collect a large number of offensive tweets. We apply the proposed method on Arabic tweets and compare it with English tweets - analysing key cultural differences. We observed a constant usage of these emojis to represent offensiveness throughout different timespans on Twitter. We manually annotate and publicly release the largest Arabic dataset for offensive, fine-grained hate speech, vulgar and violence content. Furthermore, we benchmark the dataset for detecting offensiveness and hate speech using different transformer architectures and perform in-depth linguistic analysis. We evaluate our models on external datasets - a Twitter dataset collected using a completely different method, and a multi-platform dataset containing comments from Twitter, YouTube and Facebook, for assessing generalization capability. Competitive results on these datasets suggest that the data collected using our method captures universal characteristics of offensive language. Our findings also highlight the common words used in offensive communications, common targets for hate speech, specific patterns in violence tweets; and pinpoint common classification errors that can be attributed to limitations of NLP models. We observe that even state-of-the-art transformer models may fail to take into account culture, background and context or understand nuances present in real-world data such as sarcasm.