论文标题
序列级别的混合样品数据增强
Sequence-Level Mixed Sample Data Augmentation
论文作者
论文摘要
尽管他们的经验成功,但神经网络仍然很难捕获自然语言的组成方面。这项工作提出了一种简单的数据增强方法,以鼓励在序列到序列问题的神经模型中的组成行为。我们的方法Seqmix通过轻轻结合训练集的输入/输出序列来创建新的合成示例。我们将这种方法连接到现有技术,例如SwitchOut和Word Dropout,并证明这些技术都是单个目标的近似变体。 SEQMIX始终在五个不同的翻译数据集上对强型变压器基线的五个不同的翻译数据集进行大约1.0 BLEU的改进。在需要强大的组成概括(例如扫描和语义解析)的任务上,Seqmix还提供了进一步的改进。
Despite their empirical success, neural networks still have difficulty capturing compositional aspects of natural language. This work proposes a simple data augmentation approach to encourage compositional behavior in neural models for sequence-to-sequence problems. Our approach, SeqMix, creates new synthetic examples by softly combining input/output sequences from the training set. We connect this approach to existing techniques such as SwitchOut and word dropout, and show that these techniques are all approximating variants of a single objective. SeqMix consistently yields approximately 1.0 BLEU improvement on five different translation datasets over strong Transformer baselines. On tasks that require strong compositional generalization such as SCAN and semantic parsing, SeqMix also offers further improvements.