葡萄牙 - 英语和英语葡萄牙语翻译的精简版培训策略

论文标题

葡萄牙 - 英语和英语葡萄牙语翻译的精简版培训策略

Lite Training Strategies for Portuguese-English and English-Portuguese Translation

论文作者

Lopes, Alexandre, Nogueira, Rodrigo, Lotufo, Roberto, Pedrini, Helio

论文摘要

尽管对机器翻译的深度学习广泛采用，但开发高质量的翻译模型仍然很昂贵。在这项工作中，我们研究了预先训练的模型的使用，例如T5用于使用低成本硬件的葡萄牙 - 英语和英语葡萄牙语翻译任务。我们探讨了葡萄牙和英语预培训的语言模型的使用，并提出了对英语令牌的改编，以代表葡萄牙语，例如二氧化碳，急性和严重的口音。我们将模型与Google Translate API和Marianmt的一个子集进行了比较，以及与WMT19生物医学翻译共享任务的获奖提交。我们还描述了我们对WMT20生物医学翻译共享任务的提交。我们的结果表明，我们的模型在最先进的模型中具有竞争性能，同时接受了适度的硬件培训（一个8GB游戏GPU持续了九天）。我们的数据，模型和代码可从https://github.com/unicamp-dl/lite-t5-translation获得。

Despite the widespread adoption of deep learning for machine translation, it is still expensive to develop high-quality translation models. In this work, we investigate the use of pre-trained models, such as T5 for Portuguese-English and English-Portuguese translation tasks using low-cost hardware. We explore the use of Portuguese and English pre-trained language models and propose an adaptation of the English tokenizer to represent Portuguese characters, such as diaeresis, acute and grave accents. We compare our models to the Google Translate API and MarianMT on a subset of the ParaCrawl dataset, as well as to the winning submission to the WMT19 Biomedical Translation Shared Task. We also describe our submission to the WMT20 Biomedical Translation Shared Task. Our results show that our models have a competitive performance to state-of-the-art models while being trained on modest hardware (a single 8GB gaming GPU for nine days). Our data, models and code are available at https://github.com/unicamp-dl/Lite-T5-Translation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题