论文标题

qutnocturnal@hasoc'19:CNN仇恨言论和印地语语言的令人反感的内容识别

QutNocturnal@HASOC'19: CNN for Hate Speech and Offensive Content Identification in Hindi Language

论文作者

Bashar, Md Abul, Nayak, Richi

论文摘要

我们描述了我们在2019年大火组织的HASOC竞赛中为印地语任务1的顶级团队解决方案。任务是在印地语中确定仇恨言论和进攻性语言。更具体地说,这是一个二进制分类问题,其中需要系统将推文分类为两个类:(a)\ emph {仇恨和进攻(hof)}和(b)\ emph {不恨或仇恨或冒犯性(非)}。与来自Wikipedia等通用领域的大型语料库进行预处理矢量(又称单词嵌入)的流行思想相反,我们使用了相对较小的相关推文(即印地语和Hinglish中的随机和讽刺推文)进行预告。我们在验证的单词矢量之上训练了卷积神经网络(CNN)。这种方法使我们可以在所有团队中排名第一位。我们的方法很容易适应其他应用程序,该应用程序是在提供的上下文有限的情况下预测文本类别的其他应用程序。

We describe our top-team solution to Task 1 for Hindi in the HASOC contest organised by FIRE 2019. The task is to identify hate speech and offensive language in Hindi. More specifically, it is a binary classification problem where a system is required to classify tweets into two classes: (a) \emph{Hate and Offensive (HOF)} and (b) \emph{Not Hate or Offensive (NOT)}. In contrast to the popular idea of pretraining word vectors (a.k.a. word embedding) with a large corpus from a general domain such as Wikipedia, we used a relatively small collection of relevant tweets (i.e. random and sarcasm tweets in Hindi and Hinglish) for pretraining. We trained a Convolutional Neural Network (CNN) on top of the pretrained word vectors. This approach allowed us to be ranked first for this task out of all teams. Our approach could easily be adapted to other applications where the goal is to predict class of a text when the provided context is limited.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源