通过事后解释，将仇恨言语分类器与情境化

论文标题

通过事后解释，将仇恨言语分类器与情境化

Contextualizing Hate Speech Classifiers with Post-hoc Explanation

论文作者

Kennedy, Brendan, Jin, Xisen, Davani, Aida Mostafazadeh, Dehghani, Morteza, Ren, Xiang

论文摘要

在不平衡数据集中训练的仇恨言语分类器难以确定诸如“同性恋”或“黑色”之类的群体标识符是否以进攻性或偏见的方式使用。当这些标识符存在时，由于模型无法学习构成标识符仇恨用法的上下文，这种偏见就以误报表现出来。我们从微调的BERT分类器中提取SOC事后解释，以有效地检测出对身份术语的偏见。然后，我们根据这些解释提出了一种新颖的正则化技术，该解释鼓励模型除了标识符本身之外，还可以从组标识符的背景下学习。我们的方法在限制误报范围数据的同时，在维持或改善内域性能的同时，我们的方法改善了基线。项目页面：https：//inklab.usc.edu/contextualize-hate-speech/。

Hate speech classifiers trained on imbalanced datasets struggle to determine if group identifiers like "gay" or "black" are used in offensive or prejudiced ways. Such biases manifest in false positives when these identifiers are present, due to models' inability to learn the contexts which constitute a hateful usage of identifiers. We extract SOC post-hoc explanations from fine-tuned BERT classifiers to efficiently detect bias towards identity terms. Then, we propose a novel regularization technique based on these explanations that encourages models to learn from the context of group identifiers in addition to the identifiers themselves. Our approach improved over baselines in limiting false positives on out-of-domain data while maintaining or improving in-domain performance. Project page: https://inklab.usc.edu/contextualize-hate-speech/.

下载PDF全文

下载文献需遵守相关版权规定

论文标题