论文标题
一个简单的语言敏锐性但非常强大的基线系统,用于仇恨言论和令人反感的内容识别
A simple language-agnostic yet very strong baseline system for hate speech and offensive content identification
论文作者
论文摘要
Satlab团队提出了自动识别Tweet中的仇恨言论和进攻内容,该系统仅根据字符n-grams提供的经典监督算法,因此是完全语言 - 敏捷的系统。在功能加权和分类器参数方面进行了优化之后,它在多语言Hasoc 2021挑战中达到了中等的性能水平,该语言易于开发依靠许多外部语言资源的深度学习方法,但对于两种资源不足的语言,印度语和马拉松来说,它的水平更好。当这些语言中的三个任务平均表演平均表演时,它甚至首先结束,表现优于许多深度学习方法。这些表演表明,评估使用更复杂的方法(例如深度学习或考虑互补资源)的好处是一个有趣的参考水平。
For automatically identifying hate speech and offensive content in tweets, a system based on a classical supervised algorithm only fed with character n-grams, and thus completely language-agnostic, is proposed by the SATLab team. After its optimization in terms of the feature weighting and the classifier parameters, it reached, in the multilingual HASOC 2021 challenge, a medium performance level in English, the language for which it is easy to develop deep learning approaches relying on many external linguistic resources, but a far better level for the two less resourced language, Hindi and Marathi. It ends even first when performances are averaged over the three tasks in these languages, outperforming many deep learning approaches. These performances suggest that it is an interesting reference level to evaluate the benefits of using more complex approaches such as deep learning or taking into account complementary resources.