论文标题
使用非扬声器注释来建立低资源的NER模型
Building Low-Resource NER Models Using Non-Speaker Annotation
论文作者
论文摘要
在低资源的自然语言处理(NLP)中,关键问题是缺乏目标语言培训数据,以及缺乏以母语为母语的人来创建它。跨语言方法在解决这些问题方面取得了显着的成功,但是在某些常见情况下,例如培训前语料库不足或远离源语言的语言,其性能会受到影响。在这项工作中,我们提出了一种补充方法,以使用``非说话者''(NS)注释来构建名为“实体识别”模型的低资源识别模型,该注释由注释者提供,没有目标语言的经验。我们与印尼,俄罗斯和印地语进行了精心控制的注释实验招募30名参与者。我们表明,NS注释者的使用比在现代上下文表示上构建的跨语义方法始终取得的结果始终如一或更好,并且有可能付出额外的努力。我们以观察常见的注释模式和建议的实施实践的观察结束,并激发了如何使用NS注释,除了先前的方法以提高性能。有关更多详细信息,http://cogcomp.org/page/publication_view/941
In low-resource natural language processing (NLP), the key problems are a lack of target language training data, and a lack of native speakers to create it. Cross-lingual methods have had notable success in addressing these concerns, but in certain common circumstances, such as insufficient pre-training corpora or languages far from the source language, their performance suffers. In this work we propose a complementary approach to building low-resource Named Entity Recognition (NER) models using ``non-speaker'' (NS) annotations, provided by annotators with no prior experience in the target language. We recruit 30 participants in a carefully controlled annotation experiment with Indonesian, Russian, and Hindi. We show that use of NS annotators produces results that are consistently on par or better than cross-lingual methods built on modern contextual representations, and have the potential to outperform with additional effort. We conclude with observations of common annotation patterns and recommended implementation practices, and motivate how NS annotations can be used in addition to prior methods for improved performance. For more details, http://cogcomp.org/page/publication_view/941