神经机器翻译的合成预训练任务

论文标题

神经机器翻译的合成预训练任务

Synthetic Pre-Training Tasks for Neural Machine Translation

论文作者

He, Zexue, Blackwood, Graeme, Panda, Rameswar, McAuley, Julian, Feris, Rogerio

论文摘要

具有大型爬行语料库的训练前模型可能会导致诸如毒性和偏见以及版权和隐私问题等问题。减轻此类担忧的一种有希望的方法是对合成任务和数据进行预训练，因为该模型没有摄入现实世界中的信息。本文我们的目标是了解使用合成资源时，尤其是在神经机器翻译的背景下，有助于培训模型的有效性的因素。我们提出了几种涉及不同级别的词汇和结构知识的预训练翻译模型的新方法，包括：1）从大型平行语料库中产生混淆的数据2）串联短语对从小单词相称的语料库中提取的短语对，以及3）生成合成平行数据，而无需实际的人类语言Corpora。我们在多种语言对的实验表明，即使使用高水平的混淆或纯粹的合成平行数据，也可以实现培训益处。我们希望我们全面的经验分析的发现能够理解对NMT预训练重要的事情，并为开发更有效和毒性模型的发展铺平道路。

Pre-training models with large crawled corpora can lead to issues such as toxicity and bias, as well as copyright and privacy concerns. A promising way of alleviating such concerns is to conduct pre-training with synthetic tasks and data, since no real-world information is ingested by the model. Our goal in this paper is to understand the factors that contribute to the effectiveness of pre-training models when using synthetic resources, particularly in the context of neural machine translation. We propose several novel approaches to pre-training translation models that involve different levels of lexical and structural knowledge, including: 1) generating obfuscated data from a large parallel corpus 2) concatenating phrase pairs extracted from a small word-aligned corpus, and 3) generating synthetic parallel data without real human language corpora. Our experiments on multiple language pairs reveal that pre-training benefits can be realized even with high levels of obfuscation or purely synthetic parallel data. We hope the findings from our comprehensive empirical analysis will shed light on understanding what matters for NMT pre-training, as well as pave the way for the development of more efficient and less toxic models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题