论文标题
关于并行数据在跨语性转移学习中的作用
On the Role of Parallel Data in Cross-lingual Transfer Learning
论文作者
论文摘要
虽然先前的工作已经确定并行数据有利于跨语性学习,但尚不清楚这些改进是否来自数据本身,还是重要的是对并行交互的建模。为了探讨这一点,我们检查了无监督的机器翻译的用法以生成综合并行数据,并将其与监督的机器翻译和金牌并行数据进行比较。我们发现,即使模型生成的并行数据也可用于下游任务,在一般设置(持续预处理)以及特定于任务的设置(Translate-Train)中,尽管我们的最佳结果仍然使用真实的并行数据获得。我们的发现表明,现有的多语言模型不会利用单语数据的全部潜力,并促使社区重新考虑跨语性学习方法的传统分类。
While prior work has established that the use of parallel data is conducive for cross-lingual learning, it is unclear if the improvements come from the data itself, or if it is the modeling of parallel interactions that matters. Exploring this, we examine the usage of unsupervised machine translation to generate synthetic parallel data, and compare it to supervised machine translation and gold parallel data. We find that even model generated parallel data can be useful for downstream tasks, in both a general setting (continued pretraining) as well as the task-specific setting (translate-train), although our best results are still obtained using real parallel data. Our findings suggest that existing multilingual models do not exploit the full potential of monolingual data, and prompt the community to reconsider the traditional categorization of cross-lingual learning approaches.