论文标题

在神经机器翻译中,字节对编码的效果如何?

How Effective is Byte Pair Encoding for Out-Of-Vocabulary Words in Neural Machine Translation?

论文作者

Araabi, Ali, Monz, Christof, Niculae, Vlad

论文摘要

神经机器翻译(NMT)是一个开放的词汇问题。结果,处理在培训期间没有出现的单词(又称唱片外(OOV)单词)长期以来一直是NMT系统的基本挑战。解决此问题的主要方法是字节对编码(BPE),将包括OOV单词在内的单词分为子字段中。在自动评估指标方面,BPE为广泛的翻译任务取得了令人印象深刻的结果。虽然通常假定使用BPE,但NMT系统能够处理OOV单词,但BPE在翻译OOV单词中的有效性尚未明确测量。在本文中,我们研究了BPE在多大程度上成功地翻译了单词级别的OOV单词。我们根据单词类型,段数,交叉注意权重以及训练数据中的段NGram的频率分析OOV单词的翻译质量。我们的实验表明,尽管仔细的BPE设置似乎在整个数据集中翻译OOV单词时相当有用,但很大一部分OOV单词被错误地翻译。此外,我们重点介绍了BPE在为特殊情况(例如命名本性和涉及的语言彼此接近的语言时)翻译OOV单词的有效性稍高。

Neural Machine Translation (NMT) is an open vocabulary problem. As a result, dealing with the words not occurring during training (a.k.a. out-of-vocabulary (OOV) words) have long been a fundamental challenge for NMT systems. The predominant method to tackle this problem is Byte Pair Encoding (BPE) which splits words, including OOV words, into sub-word segments. BPE has achieved impressive results for a wide range of translation tasks in terms of automatic evaluation metrics. While it is often assumed that by using BPE, NMT systems are capable of handling OOV words, the effectiveness of BPE in translating OOV words has not been explicitly measured. In this paper, we study to what extent BPE is successful in translating OOV words at the word-level. We analyze the translation quality of OOV words based on word type, number of segments, cross-attention weights, and the frequency of segment n-grams in the training data. Our experiments show that while careful BPE settings seem to be fairly useful in translating OOV words across datasets, a considerable percentage of OOV words are translated incorrectly. Furthermore, we highlight the slightly higher effectiveness of BPE in translating OOV words for special cases, such as named-entities and when the languages involved are linguistically close to each other.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源