减少俄罗斯仇恨言论检测中意外的身份偏见

论文标题

减少俄罗斯仇恨言论检测中意外的身份偏见

Reducing Unintended Identity Bias in Russian Hate Speech Detection

论文作者

Zueva, Nadezhda, Kabirova, Madina, Kalaidin, Pavel

论文摘要

对于许多在线社区来说，毒性已成为一个严重的问题，并且在包括俄罗斯在内的许多语言中都在增长。仇恨言论会创造一个恐吓，歧视的环境，甚至可能煽动一些现实世界中的暴力。研究人员和社交平台都集中在开发模型以检测在线通信中的毒性一段时间。这些模型的一个普遍问题是对某些单词（例如女人，黑人，犹太人）的偏见，这些单词无毒，但由于模型警告而成为分类器的触发器。在本文中，我们描述了我们在俄语中对仇恨言论进行分类的努力，并提出了简单的技术来减少意外偏见，例如使用与受保护的身份相关的术语和单词来生成培训数据，例如上下文，并将单词删除应用于此类单词。

Toxicity has become a grave problem for many online communities and has been growing across many languages, including Russian. Hate speech creates an environment of intimidation, discrimination, and may even incite some real-world violence. Both researchers and social platforms have been focused on developing models to detect toxicity in online communication for a while now. A common problem of these models is the presence of bias towards some words (e.g. woman, black, jew) that are not toxic, but serve as triggers for the classifier due to model caveats. In this paper, we describe our efforts towards classifying hate speech in Russian, and propose simple techniques of reducing unintended bias, such as generating training data with language models using terms and words related to protected identities as context and applying word dropout to such words.

下载PDF全文

下载文献需遵守相关版权规定

论文标题