Robbert-2022：更新荷兰语言模型以说明不断发展的语言使用

论文标题

Robbert-2022：更新荷兰语言模型以说明不断发展的语言使用

RobBERT-2022: Updating a Dutch Language Model to Account for Evolving Language Use

论文作者

Delobelle, Pieter, Winters, Thomas, Berendt, Bettina

论文摘要

大型的基于变压器的语言模型，例如BERT和GPT-3，在大多数自然语言处理任务上的先前体系结构都优于以前的体系结构。此类语言模型首先是在巨大的文本中进行预训练的，后来用作特定任务进行填充的基本模型。由于通常不会重复培训步骤，因此基本模型与最新信息无关。在本文中，我们更新了罗伯特（Robbert），这是一种总部位于罗伯塔（Roberta）的最先进的荷兰语言模型，该模型于2019年接受了培训。首先，Robbert的令牌被更新到包括最新的荷兰Oscar语料库中的新的高频标记，例如。与电晕有关的单词。然后，我们使用此数据集进一步预先培训Robbert模型。为了评估我们的新模型是否是Robbert的插件替代品，我们根据现有令牌的概念漂移和新颖令牌的对齐方式介绍了两个其他标准。我们发现，对于某些语言任务，此更新会导致性能大幅提高。这些结果突出了不断更新语言模型以说明语言使用不断发展的好处。

Large transformer-based language models, e.g. BERT and GPT-3, outperform previous architectures on most natural language processing tasks. Such language models are first pre-trained on gigantic corpora of text and later used as base-model for finetuning on a particular task. Since the pre-training step is usually not repeated, base models are not up-to-date with the latest information. In this paper, we update RobBERT, a RoBERTa-based state-of-the-art Dutch language model, which was trained in 2019. First, the tokenizer of RobBERT is updated to include new high-frequent tokens present in the latest Dutch OSCAR corpus, e.g. corona-related words. Then we further pre-train the RobBERT model using this dataset. To evaluate if our new model is a plug-in replacement for RobBERT, we introduce two additional criteria based on concept drift of existing tokens and alignment for novel tokens.We found that for certain language tasks this update results in a significant performance increase. These results highlight the benefit of continually updating a language model to account for evolving language use.

下载PDF全文

下载文献需遵守相关版权规定

论文标题