在reddit中发现和分类语言偏见

论文标题

在reddit中发现和分类语言偏见

Discovering and Categorising Language Biases in Reddit

论文作者

Ferrer, Xavier, van Nuenen, Tom, Such, Jose M., Criado, Natalia

论文摘要

我们提出了一种使用单词嵌入的数据驱动方法，以发现和分类讨论平台Reddit上的语言偏见。作为孤立用户社区的空间，诸如Reddit之类的平台越来越多地与种族主义，性别歧视和其他形式的歧视问题联系在一起。因此，有必要监视这些组的语言。在大型文本数据集中追踪语言偏见的最有希望的AI方法之一是单词嵌入，它将文本转换为高维密度向量并捕获单词之间的语义关系。然而，以前的研究需要预定义的潜在偏见来研究，例如，性别是否或多或少与特定类型的工作相关。这使得这些方法不适合处理较小且以社区为中心的数据集，例如Reddit上的数据集，其中包含较小的词汇和lang语以及可能对该社区特别的偏见。本文提出了一种数据驱动的方法，以自动发现Reddit上在线话语社区词汇中编码的语言偏见。在我们的方法中，受保护的属性与数据中发现的评估词相连，然后通过语义分析系统对其进行分类。我们通过将我们在Google新闻数据集中发现的偏差与以前的文献中发现的偏差进行比较来验证方法的有效性。然后，我们成功地发现了不同的红色社区中的性别偏见，宗教偏见和种族偏见。我们通过讨论该数据驱动的偏见发现方法的潜在应用程序方案和局限性来结束。

We present a data-driven approach using word embeddings to discover and categorise language biases on the discussion platform Reddit. As spaces for isolated user communities, platforms such as Reddit are increasingly connected to issues of racism, sexism and other forms of discrimination. Hence, there is a need to monitor the language of these groups. One of the most promising AI approaches to trace linguistic biases in large textual datasets involves word embeddings, which transform text into high-dimensional dense vectors and capture semantic relations between words. Yet, previous studies require predefined sets of potential biases to study, e.g., whether gender is more or less associated with particular types of jobs. This makes these approaches unfit to deal with smaller and community-centric datasets such as those on Reddit, which contain smaller vocabularies and slang, as well as biases that may be particular to that community. This paper proposes a data-driven approach to automatically discover language biases encoded in the vocabulary of online discourse communities on Reddit. In our approach, protected attributes are connected to evaluative words found in the data, which are then categorised through a semantic analysis system. We verify the effectiveness of our method by comparing the biases we discover in the Google News dataset with those found in previous literature. We then successfully discover gender bias, religion bias, and ethnic bias in different Reddit communities. We conclude by discussing potential application scenarios and limitations of this data-driven bias discovery method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题