语言模型的可接受性判断并不总是强大的上下文

论文标题

语言模型的可接受性判断并不总是强大的上下文

Language model acceptability judgements are not always robust to context

论文作者

Sinha, Koustuv, Gauthier, Jon, Mueller, Aaron, Misra, Kanishka, Fuentes, Keren, Levy, Roger, Williams, Adina

论文摘要

对语言模型的有针对性的句法评估询问模型是否显示出比最小对不可接受的输入的句法可接受内容稳定的偏好。大多数有针对性的句法评估数据集要求模型仅以单个无上下文句子作为输入来做出这些判断。这与语言模型的培训制度不符，在这种培训方案中，周围语料库总是高度上下文化的输入句子。这种不匹配提出了一个重要的问题：在不同情况下，模型的句法判断力如何？在本文中，我们研究了语言模型在目标语法评估上的性能的稳定性，因为我们改变了输入上下文的特性：上下文的长度，其所包含的句法现象的类型以及是否存在语法性。我们发现，将模型判断放置在随机采样的语言环境中时通常是可靠的。但是，对于包含符合关键测试内容的句法结构的上下文，它们基本上是不稳定的。在所有经过测试的模型（GPT-2和五个OPT变体）中，我们通过提供匹配的句法结构来显着改善模型的判断，相反，使用具有匹配但违反句法结构的不可接受的上下文来使它们恶化。除了无关的输入外，上下文的长度会放大此效果。我们表明，模型性能中的这些变化无法通过与上下文和测试输入（例如词汇重叠和依赖关系重叠）相匹配的简单功能来解释。对上下文的高度特定句法特征的这种敏感性只能通过模型的隐式内在学习能力来解释。

Targeted syntactic evaluations of language models ask whether models show stable preferences for syntactically acceptable content over minimal-pair unacceptable inputs. Most targeted syntactic evaluation datasets ask models to make these judgements with just a single context-free sentence as input. This does not match language models' training regime, in which input sentences are always highly contextualized by the surrounding corpus. This mismatch raises an important question: how robust are models' syntactic judgements in different contexts? In this paper, we investigate the stability of language models' performance on targeted syntactic evaluations as we vary properties of the input context: the length of the context, the types of syntactic phenomena it contains, and whether or not there are violations of grammaticality. We find that model judgements are generally robust when placed in randomly sampled linguistic contexts. However, they are substantially unstable for contexts containing syntactic structures matching those in the critical test content. Among all tested models (GPT-2 and five variants of OPT), we significantly improve models' judgements by providing contexts with matching syntactic structures, and conversely significantly worsen them using unacceptable contexts with matching but violated syntactic structures. This effect is amplified by the length of the context, except for unrelated inputs. We show that these changes in model performance are not explainable by simple features matching the context and the test inputs, such as lexical overlap and dependency overlap. This sensitivity to highly specific syntactic features of the context can only be explained by the models' implicit in-context learning abilities.

下载PDF全文

下载文献需遵守相关版权规定

论文标题