重新审视的诱人：对诱人的关系提取任务的彻底评估

论文标题

重新审视的诱人：对诱人的关系提取任务的彻底评估

TACRED Revisited: A Thorough Evaluation of the TACRED Relation Extraction Task

论文作者

Alt, Christoph, Gabryszak, Aleksandra, Hennig, Leonhard

论文摘要

Tacred（Zhang等人，2017年）是关系提取（RE）最大，最广泛使用的众包数据集之一。但是，即使在无监督的预训练和知识增强的神经RE的最新进展中，模型仍然显示出很高的错误率。在本文中，我们调查了问题：我们是否达到了表演天花板，还是仍然有改进的空间？人群注释，数据集和模型如何导致此错误率？为了回答这些问题，我们首先使用训练有素的注释者验证了开发和测试集中最具挑战性的5K示例。我们发现标签错误占绝对F1测试错误的8％，并且需要重新标记超过50％的示例。在重新标记的测试集中，大基线模型集的平均F1得分从62.1提高到70.1。验证后，我们分析了对具有挑战性的实例的错误分类，将其分为语言动机的错误组，并验证三种最先进的RE模型上产生的错误假设。我们表明，两组模棱两可的关系是造成大多数剩余错误的原因，并且如果未掩盖实体，模型可能会在数据集上采用浅启发式方法。

TACRED (Zhang et al., 2017) is one of the largest, most widely used crowdsourced datasets in Relation Extraction (RE). But, even with recent advances in unsupervised pre-training and knowledge enhanced neural RE, models still show a high error rate. In this paper, we investigate the questions: Have we reached a performance ceiling or is there still room for improvement? And how do crowd annotations, dataset, and models contribute to this error rate? To answer these questions, we first validate the most challenging 5K examples in the development and test sets using trained annotators. We find that label errors account for 8% absolute F1 test error, and that more than 50% of the examples need to be relabeled. On the relabeled test set the average F1 score of a large baseline model set improves from 62.1 to 70.1. After validation, we analyze misclassifications on the challenging instances, categorize them into linguistically motivated error groups, and verify the resulting error hypotheses on three state-of-the-art RE models. We show that two groups of ambiguous relations are responsible for most of the remaining errors and that models may adopt shallow heuristics on the dataset when entities are not masked.

下载PDF全文

下载文献需遵守相关版权规定

论文标题