论文标题
阿迪马:多语言音频中的滥用检测
ADIMA: Abuse Detection In Multilingual Audio
论文作者
论文摘要
可以通过进行自动语音识别(ASR)和利用自然语言处理的进步来解决口头文本中的虐待内容检测。但是,ASR模型引入了潜伏期,并且经常在亵渎语言中的亵渎单词中表现出色,而在培训语料库中的代表性不足,并且并不清楚或不完全使用。完全缺乏音频数据集,完全在音频域中探索了此问题。在这些挑战的基础上,我们提出了Adima,这是一种新颖的,语言上的多样,伦理采购的,专家注释和良好平衡的多语言亵渎检测音频数据集,其中包含10个指示语言的11,775个音频样本,跨越65小时,并由6,446个独特的用户说话。通过跨单语和跨语义零击设置进行的定量实验,我们迈出了基于音频的内容介入的第一步,以指示性语言为基础的内容审核,并列出了我们的数据集以铺平未来的工作。
Abusive content detection in spoken text can be addressed by performing Automatic Speech Recognition (ASR) and leveraging advancements in natural language processing. However, ASR models introduce latency and often perform sub-optimally for profane words as they are underrepresented in training corpora and not spoken clearly or completely. Exploration of this problem entirely in the audio domain has largely been limited by the lack of audio datasets. Building on these challenges, we propose ADIMA, a novel, linguistically diverse, ethically sourced, expert annotated and well-balanced multilingual profanity detection audio dataset comprising of 11,775 audio samples in 10 Indic languages spanning 65 hours and spoken by 6,446 unique users. Through quantitative experiments across monolingual and cross-lingual zero-shot settings, we take the first step in democratizing audio based content moderation in Indic languages and set forth our dataset to pave future work.