论文标题

Indolem和Indobert:印尼NLP的基准数据集和预培训的语言模型

IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP

论文作者

Koto, Fajri, Rahimi, Afshin, Lau, Jey Han, Baldwin, Timothy

论文摘要

尽管印尼语言近2亿人和世界上第10位口语的人都使用,但它在NLP研究中的代表性不足。缺乏注释的数据集,语言资源的稀疏以及缺乏资源标准化的情况,对印度尼西亚人的先前工作受到了阻碍。在这项工作中,我们发布了indolem数据集,其中包含印尼语言的七个任务,涵盖了形态 - 码头,语义和话语。我们还发布了印尼语的新的预训练的语言模型Indobert,除了对现有资源进行基准测试之外,还对Indolem进行了评估。我们的实验表明,在indolem中的大多数任务中,印度人在达到最先进的表现。

Although the Indonesian language is spoken by almost 200 million people and the 10th most spoken language in the world, it is under-represented in NLP research. Previous work on Indonesian has been hampered by a lack of annotated datasets, a sparsity of language resources, and a lack of resource standardization. In this work, we release the IndoLEM dataset comprising seven tasks for the Indonesian language, spanning morpho-syntax, semantics, and discourse. We additionally release IndoBERT, a new pre-trained language model for Indonesian, and evaluate it over IndoLEM, in addition to benchmarking it against existing resources. Our experiments show that IndoBERT achieves state-of-the-art performance over most of the tasks in IndoLEM.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源