Knowda：低资源NLP中数据增强的多合一知识混合模型

论文标题

Knowda：低资源NLP中数据增强的多合一知识混合模型

KnowDA: All-in-One Knowledge Mixture Model for Data Augmentation in Low-Resource NLP

论文作者

Wang, Yufei, Zheng, Jiayi, Xu, Can, Geng, Xiubo, Shen, Tao, Tao, Chongyang, Jiang, Daxin

论文摘要

本文重点介绍了训练集有限的低资源NLP任务的数据增强。现有的解决方案使用有限培训实例使用有限培训实例来生成新的合成数据。因此，它们具有特定于任务的知识，并且仅限于产生低质量的合成数据。为了解决这个问题，我们提出了知识混合数据增强模型（KNOWDA），该模型是在新颖的知识混合培训（KOMT）的新型NLP任务中预先训练的SEQ2SEQ语言模型。 KOMT的目的是将各种特定于任务的知识浓缩为单个知识模型（即多合一），以便Knewda可以利用这些知识来快速掌握目标任务的固有综合定律，从而通过有限的培训实例。具体而言，KOMT将各种异质NLP任务的输入示例重新定义为统一的文本到文本格式，并采用不同粒度的培训目标来学习重建部分或完整的样本。据我们所知，我们是首次尝试将100+ NLP多任务培训应用于数据增强。广泛的实验表明，i）KNOWDA产生的合成数据成功地提高了强大的预训练的语言模型（即Bert，Albert和Deberta）的性能，该模型在低资源的NLP基准的少量差距上很大，很少有girglue，Conll'03和Wikiann； ii）Knowda成功将任务知识转移到了KOMT中看到和看不见的NLP任务。

This paper focuses on the data augmentation for low-resource NLP tasks where the training set is limited. The existing solutions either leverage task-independent heuristic rules (e.g., Synonym Replacement) or fine-tune general-purpose pre-trained language models (e.g., GPT2) using the limited training instances to produce new synthetic data. Consequently, they have trivial task-specific knowledge and are limited to yielding low-quality synthetic data. To combat this issue, we propose Knowledge Mixture Data Augmentation Model (KnowDA) which is an Seq2Seq language model pre-trained on a mixture of diverse NLP tasks under a novel framework of Knowledge Mixture Training (KoMT). The goal of KoMT is to condense diverse NLP task-specific knowledge into the single KnowDA model (i.e., all-in-one) such that KnowDA could utilize these knowledge to quickly grasp the inherent synthesis law of the target task through limited training instances. Specifically, KoMT reformulates input examples from various heterogeneous NLP tasks into a unified text-to-text format, and employs denoising training objectives in different granularity to learn to reconstruct partial or complete samples. To the best of our knowledge, we are the first attempt to apply 100+ NLP multi-task training for data augmentation. Extensive experiments show that i) the synthetic data produced by KnowDA successfully improves performance of the strong pre-trained language models (i.e., Bert, ALBert and Deberta) by a large margin on the low-resource NLP benchmark FewGLUE, CoNLL'03 and WikiAnn; ii) KnowDA successfully transfers the task knowledge to NLP tasks whose types are seen and unseen in KoMT.

下载PDF全文

下载文献需遵守相关版权规定

论文标题