论文标题
双语文本提取作为阅读理解
Bilingual Text Extraction as Reading Comprehension
论文作者
论文摘要
在本文中,我们提出了一种方法,通过将问题作为令牌级别的预测(例如小队式的阅读理解理解)构架,从而从嘈杂的平行语料库中自动提取双语文本。要提取目标文档的跨度,该目标文档是给定源句子(SPAN)的翻译,我们使用Qanet或多语言BERT。 Qanet可以从头开始训练特定的平行语料库,而多语言Bert可以利用预训练的多语言表示。对于使用Qanet的跨度预测方法,我们使用整数线性编程引入了一种总优化方法,以在预测的并行跨度中达到一致性。我们使用模拟嘈杂的平行库(en-fr和en-ja)进行了平行的句子提取实验,并发现使用QANET的提议方法比使用两个双向RNN编码器的基线方法提出了明显更好的精度,尤其是对于远方语言对(EN-JA)。我们还使用En-JA报纸文章进行了句子对准实验,发现使用多语言BERT的提议方法比使用双语词典和动态编程的基线方法实现了明显更好的精度。
In this paper, we propose a method to extract bilingual texts automatically from noisy parallel corpora by framing the problem as a token-level span prediction, such as SQuAD-style Reading Comprehension. To extract a span of the target document that is a translation of a given source sentence (span), we use either QANet or multilingual BERT. QANet can be trained for a specific parallel corpus from scratch, while multilingual BERT can utilize pre-trained multilingual representations. For the span prediction method using QANet, we introduce a total optimization method using integer linear programming to achieve consistency in the predicted parallel spans. We conduct a parallel sentence extraction experiment using simulated noisy parallel corpora with two language pairs (En-Fr and En-Ja) and find that the proposed method using QANet achieves significantly better accuracy than a baseline method using two bi-directional RNN encoders, particularly for distant language pairs (En-Ja). We also conduct a sentence alignment experiment using En-Ja newspaper articles and find that the proposed method using multilingual BERT achieves significantly better accuracy than a baseline method using a bilingual dictionary and dynamic programming.