论文标题
Docogen:低资源域适应的域反事实生成
DoCoGen: Domain Counterfactual Generation for Low Resource Domain Adaptation
论文作者
论文摘要
自然语言处理(NLP)算法已经非常成功,但是当应用于分发示例时,它们仍然挣扎。在本文中,我们提出了一种可控的生成方法,以应对此领域适应(DA)挑战。给定输入文本示例,我们的文档算法会生成一个域 - 互动文本示例(D-CON) - 与所有方面(包括任务标签)相似,但其域已更改为所需的域。重要的是,仅使用来自多个域中的未标记示例对文档进行训练 - 不需要NLP任务标签或平行的文本示例对及其域 - 相互作用。我们表明,文档原可以产生由多个句子组成的相干反事实。我们使用Docogen生成的D-CON来扩大情感分类器和分别在20和78个DA设置中的多标签意图分类器,其中源域标记的数据稀缺。我们的模型表现优于强大的基线,并提高了最先进的无监督DA算法的准确性。
Natural language processing (NLP) algorithms have become very successful, but they still struggle when applied to out-of-distribution examples. In this paper we propose a controllable generation approach in order to deal with this domain adaptation (DA) challenge. Given an input text example, our DoCoGen algorithm generates a domain-counterfactual textual example (D-con) - that is similar to the original in all aspects, including the task label, but its domain is changed to a desired one. Importantly, DoCoGen is trained using only unlabeled examples from multiple domains - no NLP task labels or parallel pairs of textual examples and their domain-counterfactuals are required. We show that DoCoGen can generate coherent counterfactuals consisting of multiple sentences. We use the D-cons generated by DoCoGen to augment a sentiment classifier and a multi-label intent classifier in 20 and 78 DA setups, respectively, where source-domain labeled data is scarce. Our model outperforms strong baselines and improves the accuracy of a state-of-the-art unsupervised DA algorithm.