Docogen：低资源域适应的域反事实生成

论文标题

Docogen：低资源域适应的域反事实生成

DoCoGen: Domain Counterfactual Generation for Low Resource Domain Adaptation

论文作者

Calderon, Nitay, Ben-David, Eyal, Feder, Amir, Reichart, Roi

论文摘要

自然语言处理（NLP）算法已经非常成功，但是当应用于分发示例时，它们仍然挣扎。在本文中，我们提出了一种可控的生成方法，以应对此领域适应（DA）挑战。给定输入文本示例，我们的文档算法会生成一个域 - 互动文本示例（D-CON） - 与所有方面（包括任务标签）相似，但其域已更改为所需的域。重要的是，仅使用来自多个域中的未标记示例对文档进行训练 - 不需要NLP任务标签或平行的文本示例对及其域 - 相互作用。我们表明，文档原可以产生由多个句子组成的相干反事实。我们使用Docogen生成的D-CON来扩大情感分类器和分别在20和78个DA设置中的多标签意图分类器，其中源域标记的数据稀缺。我们的模型表现优于强大的基线，并提高了最先进的无监督DA算法的准确性。

Natural language processing (NLP) algorithms have become very successful, but they still struggle when applied to out-of-distribution examples. In this paper we propose a controllable generation approach in order to deal with this domain adaptation (DA) challenge. Given an input text example, our DoCoGen algorithm generates a domain-counterfactual textual example (D-con) - that is similar to the original in all aspects, including the task label, but its domain is changed to a desired one. Importantly, DoCoGen is trained using only unlabeled examples from multiple domains - no NLP task labels or parallel pairs of textual examples and their domain-counterfactuals are required. We show that DoCoGen can generate coherent counterfactuals consisting of multiple sentences. We use the D-cons generated by DoCoGen to augment a sentiment classifier and a multi-label intent classifier in 20 and 78 DA setups, respectively, where source-domain labeled data is scarce. Our model outperforms strong baselines and improves the accuracy of a state-of-the-art unsupervised DA algorithm.

下载PDF全文

下载文献需遵守相关版权规定

论文标题