论文标题
KR-Bert:一种小规模的韩国语言模型
KR-BERT: A Small-Scale Korean-Specific Language Model
论文作者
论文摘要
自BERT出现以来,包括XLNet和Roberta在内的最新作品利用了由大型语料库和大量参数预先培训的句子嵌入模型。由于这样的模型具有大型硬件和大量数据,因此它们需要很长时间才能预先培训。因此,重要的是尝试制作相对性能的较小模型。在本文中,我们利用了较小的词汇和数据集培训了韩国特定模型KR-Bert。由于韩语是使用非拉丁语字母的资源较差的形态学丰富的语言之一,因此捕获多种语言BERT模型错过的语言特定语言现象也很重要。我们测试了几个令牌,包括双向字母令牌令牌,并调整了令牌化的最小令牌,从子字符级别到角色级别,以构建模型更好的词汇。通过这些调整,我们的KR-Bert模型的性能比其他现有的预训练模型相当,甚至更好,使用大小约为1/10。
Since the appearance of BERT, recent works including XLNet and RoBERTa utilize sentence embedding models pre-trained by large corpora and a large number of parameters. Because such models have large hardware and a huge amount of data, they take a long time to pre-train. Therefore it is important to attempt to make smaller models that perform comparatively. In this paper, we trained a Korean-specific model KR-BERT, utilizing a smaller vocabulary and dataset. Since Korean is one of the morphologically rich languages with poor resources using non-Latin alphabets, it is also important to capture language-specific linguistic phenomena that the Multilingual BERT model missed. We tested several tokenizers including our BidirectionalWordPiece Tokenizer and adjusted the minimal span of tokens for tokenization ranging from sub-character level to character-level to construct a better vocabulary for our model. With those adjustments, our KR-BERT model performed comparably and even better than other existing pre-trained models using a corpus about 1/10 of the size.