论文标题
反之表的SNLI培训数据没有比未登记的数据更好的概括
Counterfactually-Augmented SNLI Training Data Does Not Yield Better Generalization Than Unaugmented Data
论文作者
论文摘要
越来越多的作品表明,模型利用注释工件以在标准众包基准测试基准上实现最新性能 - 从人群工人那里收集的数据集来创建评估任务 - 同时仍然无法在同一任务的室外示例中未能进行。最近的工作探索了反事实阐述的数据的使用 - 通过最少编辑一组种子示例来产生反事实标签来构建的数据 - 增强与这些基准相关的培训数据,并构建更强大的分类器以更好地推广。但是,Khashabi等人。 (2020)发现,在控制数据集大小和收集成本时,这种类型的增强对阅读理解任务几乎没有好处。我们通过使用英语自然语言推理数据来测试模型的概括和鲁棒性,并发现在反事实上启动的SNLI数据集上训练的模型并没有比相似大小的未表现的数据集更好地概括,并且反事实增强可能会损害性能,从而使模型较小,从而损害了性能。通过标准众包技术对自然语言理解数据的反事实增强似乎不是收集培训数据的有效方法,并且需要进一步的创新才能使这一通用工作可行。
A growing body of work shows that models exploit annotation artifacts to achieve state-of-the-art performance on standard crowdsourced benchmarks---datasets collected from crowdworkers to create an evaluation task---while still failing on out-of-domain examples for the same task. Recent work has explored the use of counterfactually-augmented data---data built by minimally editing a set of seed examples to yield counterfactual labels---to augment training data associated with these benchmarks and build more robust classifiers that generalize better. However, Khashabi et al. (2020) find that this type of augmentation yields little benefit on reading comprehension tasks when controlling for dataset size and cost of collection. We build upon this work by using English natural language inference data to test model generalization and robustness and find that models trained on a counterfactually-augmented SNLI dataset do not generalize better than unaugmented datasets of similar size and that counterfactual augmentation can hurt performance, yielding models that are less robust to challenge examples. Counterfactual augmentation of natural language understanding data through standard crowdsourcing techniques does not appear to be an effective way of collecting training data and further innovation is required to make this general line of work viable.