论文标题
引发NLP任务上的生成模型的令牌一致性事项
Tokenization Consistency Matters for Generative Models on Extractive NLP Tasks
论文作者
论文摘要
生成模型已被广泛应用于求解提取任务,其中提取输入的一部分以形成所需的输出并取得了重大成功。例如,在提取问题回答(QA)中,生成模型不断产生最先进的结果。在这项工作中,我们确定了培训这些模型中通常忽略的令牌化不一致问题。此问题会损害输入和输出后这些任务的提取性质,使令牌不一致,从而导致性能下降和幻觉。我们提出了一个简单而有效的解决方案,并就提取质量检查进行了案例研究。我们表明,凭借一致的令牌化,该模型在内域和室外数据集中的性能更好,当Bart模型在Squace上训练并在8个QA数据集上进行评估时,平均值+1.7 F2增益显着。此外,该模型收敛的速度更快,并且不太可能产生远语的答案。有了这些发现,我们想在解决提取任务时应提高有关如何进行令牌化的关注,并建议在培训期间应用一致的令牌化。
Generative models have been widely applied to solve extractive tasks, where parts of the input is extracted to form the desired output, and achieved significant success. For example, in extractive question answering (QA), generative models have constantly yielded state-of-the-art results. In this work, we identify the issue of tokenization inconsistency that is commonly neglected in training these models. This issue damages the extractive nature of these tasks after the input and output are tokenized inconsistently by the tokenizer, and thus leads to performance drop as well as hallucination. We propose a simple yet effective fix to this issue and conduct a case study on extractive QA. We show that, with consistent tokenization, the model performs better in both in-domain and out-of-domain datasets, with a notable average of +1.7 F2 gain when a BART model is trained on SQuAD and evaluated on 8 QA datasets. Further, the model converges faster, and becomes less likely to generate out-of-context answers. With these findings, we would like to call for more attention on how tokenization should be done when solving extractive tasks and recommend applying consistent tokenization during training.