精确的任务形式化在Winograd模式评估中

论文标题

精确的任务形式化在Winograd模式评估中

Precise Task Formalization Matters in Winograd Schema Evaluations

论文作者

Liu, Haokun, Huang, William, Mungra, Dhara A., Bowman, Samuel R.

论文摘要

Winograd模式挑战赛（WSC）的性能是一种受人尊敬的英语常识性推理基准，最近从偶然的准确性上升到超级lue排行榜上的89％，而相对较少的证据表明，推理能力的较大改善。我们假设这些改进的大部分源于任务形式化的最新变化 - - 数据集用户的输入规范，损耗函数和重复使用预验证的参数的结合，而不是改善预审预周化的模型的推理能力。我们在两个Winograd模式数据集上进行消融，该数据集在此激增之前和之后使用的形式化之间进行了插值，并找到（i）将任务构建为多项选择提高了2-6分，以及（ii）（ii）几种其他技术，包括重新使用预处理的语言模型，可以减轻模型对模型对超级衍生产品的极端敏感性。我们敦促未来的基准创建者强加额外的结构，以最大程度地减少形式化决策对报告结果的影响。

Performance on the Winograd Schema Challenge (WSC), a respected English commonsense reasoning benchmark, recently rocketed from chance accuracy to 89% on the SuperGLUE leaderboard, with relatively little corroborating evidence of a correspondingly large improvement in reasoning ability. We hypothesize that much of this improvement comes from recent changes in task formalization---the combination of input specification, loss function, and reuse of pretrained parameters---by users of the dataset, rather than improvements in the pretrained model's reasoning ability. We perform an ablation on two Winograd Schema datasets that interpolates between the formalizations used before and after this surge, and find (i) framing the task as multiple choice improves performance by 2-6 points and (ii) several additional techniques, including the reuse of a pretrained language modeling head, can mitigate the model's extreme sensitivity to hyperparameters. We urge future benchmark creators to impose additional structure to minimize the impact of formalization decisions on reported results.

下载PDF全文

下载文献需遵守相关版权规定

论文标题