论文标题
OOLONG:通过对照研究调查是什么使转移学习很难
Oolong: Investigating What Makes Transfer Learning Hard with Controlled Studies
论文作者
论文摘要
当我们将验证的语言模型转移到新语言时,有许多变化轴一次会发生变化。为了解散不同因素(例如句法相似性和词汇相似性)的影响,我们提出了一组受控的转移研究:我们系统地改变了胶水基准的语言,一次改变了一个交法轴的轴,然后测量在预处理模型的下流表现中所产生的下降。我们发现,模型可以在很大程度上从句法风格的变化中恢复,但也无法从词汇错位中恢复并嵌入矩阵重新定位,即使对1500万个令牌进行了预读。另一方面,在低DATA制度中,很难从不一致的词汇转移到具有不一致的词汇的数据集。此外,转移语言中的优质令牌并不能使词汇对齐变得更加容易。我们的实验提供了有关研究人员在设计语言传递方案时最关注的跨语性转移因素的见解。
When we transfer a pretrained language model to a new language, there are many axes of variation that change at once. To disentangle the impact of different factors like syntactic similarity and vocabulary similarity, we propose a set of controlled transfer studies: we systematically transform the language of the GLUE benchmark, altering one axis of crosslingual variation at a time, and then measure the resulting drops in a pretrained model's downstream performance. We find that models can largely recover from syntactic-style shifts, but cannot recover from vocabulary misalignment and embedding matrix re-initialization, even with continued pretraining on 15 million tokens. %On the other hand, transferring to a dataset with an unaligned vocabulary is extremely hard to recover from in the low-data regime. Moreover, good-quality tokenizers in the transfer language do not make vocabulary alignment easier. Our experiments provide insights into the factors of cross-lingual transfer that researchers should most focus on when designing language transfer scenarios.