不受控制的词汇暴露会导致验证模型中的组成概括高估

论文标题

不受控制的词汇暴露会导致验证模型中的组成概括高估

Uncontrolled Lexical Exposure Leads to Overestimation of Compositional Generalization in Pretrained Models

论文作者

Kim, Najoung, Linzen, Tal, Smolensky, Paul

论文摘要

人类语言能力通常以组成性及其启用的概括性为特征 - 人类学习者可以通过构成已知部分来产生和理解新颖的复杂表达式。几个基准测试在培训和测试中利用分布控制来衡量组成概括，在某些词汇项目中，只有在训练过程中发生有限的情况下。虽然最近使用这些基准测试的工作表明，经过预处理的模型实现了令人印象深刻的概括性能，但我们认为接触预处理数据可能会破坏上述分布控制。使用Kim和Linzen（2020）的COGS基准测试，我们测试了两个控制此问题的改进的评估设置：（1）用新的角色序列代替上下文控制的词汇项目，（2）用新颖的嵌入式代表的特殊令牌代替它们。我们发现，这两种设置都导致T5中的概括性能降低（Raffel等，2020），这表明由于预处理期间的词汇暴露不受控制，先前报道的结果已被高估了。新型嵌入式的性能降解更加极端，并且降解随训练数据的量增加而增加，突出了一个有趣的反向缩放案例。

Human linguistic capacity is often characterized by compositionality and the generalization it enables -- human learners can produce and comprehend novel complex expressions by composing known parts. Several benchmarks exploit distributional control across training and test to gauge compositional generalization, where certain lexical items only occur in limited contexts during training. While recent work using these benchmarks suggests that pretrained models achieve impressive generalization performance, we argue that exposure to pretraining data may break the aforementioned distributional control. Using the COGS benchmark of Kim and Linzen (2020), we test two modified evaluation setups that control for this issue: (1) substituting context-controlled lexical items with novel character sequences, and (2) substituting them with special tokens represented by novel embeddings. We find that both of these setups lead to lower generalization performance in T5 (Raffel et al., 2020), suggesting that previously reported results have been overestimated due to uncontrolled lexical exposure during pretraining. The performance degradation is more extreme with novel embeddings, and the degradation increases with the amount of pretraining data, highlighting an interesting case of inverse scaling.

下载PDF全文

下载文献需遵守相关版权规定

论文标题