Indolem和Indobert：印尼NLP的基准数据集和预培训的语言模型

论文标题

Indolem和Indobert：印尼NLP的基准数据集和预培训的语言模型

IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP

论文作者

Koto, Fajri, Rahimi, Afshin, Lau, Jey Han, Baldwin, Timothy

论文摘要

尽管印尼语言近2亿人和世界上第10位口语的人都使用，但它在NLP研究中的代表性不足。缺乏注释的数据集，语言资源的稀疏以及缺乏资源标准化的情况，对印度尼西亚人的先前工作受到了阻碍。在这项工作中，我们发布了indolem数据集，其中包含印尼语言的七个任务，涵盖了形态 - 码头，语义和话语。我们还发布了印尼语的新的预训练的语言模型Indobert，除了对现有资源进行基准测试之外，还对Indolem进行了评估。我们的实验表明，在indolem中的大多数任务中，印度人在达到最先进的表现。

Although the Indonesian language is spoken by almost 200 million people and the 10th most spoken language in the world, it is under-represented in NLP research. Previous work on Indonesian has been hampered by a lack of annotated datasets, a sparsity of language resources, and a lack of resource standardization. In this work, we release the IndoLEM dataset comprising seven tasks for the Indonesian language, spanning morpho-syntax, semantics, and discourse. We additionally release IndoBERT, a new pre-trained language model for Indonesian, and evaluate it over IndoLEM, in addition to benchmarking it against existing resources. Our experiments show that IndoBERT achieves state-of-the-art performance over most of the tasks in IndoLEM.

下载PDF全文

下载文献需遵守相关版权规定

论文标题