通过机器翻译和跨语性转移阅读捷克的理解

论文标题

通过机器翻译和跨语性转移阅读捷克的理解

Reading Comprehension in Czech via Machine Translation and Cross-lingual Transfer

论文作者

Macková, Kateřina, Straka, Milan

论文摘要

阅读理解是一项经过精心研究的任务，英语培训数据集庞大。这项工作着重于为捷克的构建阅读理解系统，而无需任何手动注释的捷克培训数据。首先，我们自动将小队1.1和小队2.0数据集转换为捷克，以创建培训和开发数据，并在http://hdl.handle.net/11234/1-3249上发布。然后，我们培训并评估了几种BERT和XLM-ROBERTA基线模型。但是，我们的主要重点在于跨语化转移模型。我们报告说，XLM-Roberta模型接受了英语数据的培训，并对捷克人进行了评估，其竞争性表现非常具竞争力，仅比在翻译后的捷克数据中训练的〜模型要差约2％。考虑到该模型在培训期间没有看到任何捷克数据，这一结果非常好。跨语性转移方法非常灵活，并提供了任何语言的阅读理解，我们有足够的单语言原始文本。

Reading comprehension is a well studied task, with huge training datasets in English. This work focuses on building reading comprehension systems for Czech, without requiring any manually annotated Czech training data. First of all, we automatically translated SQuAD 1.1 and SQuAD 2.0 datasets to Czech to create training and development data, which we release at http://hdl.handle.net/11234/1-3249. We then trained and evaluated several BERT and XLM-RoBERTa baseline models. However, our main focus lies in cross-lingual transfer models. We report that a XLM-RoBERTa model trained on English data and evaluated on Czech achieves very competitive performance, only approximately 2 percent points worse than a~model trained on the translated Czech data. This result is extremely good, considering the fact that the model has not seen any Czech data during training. The cross-lingual transfer approach is very flexible and provides a reading comprehension in any language, for which we have enough monolingual raw texts.

下载PDF全文

下载文献需遵守相关版权规定

论文标题