论文标题

数学文本的机器翻译

Machine Translation of Mathematical Text

论文作者

Ohri, Aditya, Schmah, Tanya

论文摘要

我们已经实施了一个机器翻译系统,即Polymath Translator,用于包含数学文本的乳胶文档。当前的实施将英语乳胶转化为法国乳胶,在数学句子的持有测试语料库中获得53.5的BLEU得分。它生产乳胶文档,可以将其编译为PDF而无需进一步编辑。该系统首先将输入乳胶文档的主体转换为包含数学令牌的英文句子,使用Pandoc Universal文档转换器来解析乳胶输入。我们使用OpenNMT训练了一个基于变压器的转换器模型,该模型在包含一小部分域特异性句子的组合语料库上。我们的完整系统同时使用此变压器模型和Google Translate,后者被用作备份来更好地处理我们培训数据集中未出现的语言功能。如果Transformer模型对其翻译没有信心,这取决于高的困惑分数,那么我们将Google Translate与自定义词汇表一起使用。该备份在我们的数学句子测试语料库上使用了26%。 Polymath Translator可在www.polymathtrans.ai上作为Web服务提供。

We have implemented a machine translation system, the PolyMath Translator, for LaTeX documents containing mathematical text. The current implementation translates English LaTeX to French LaTeX, attaining a BLEU score of 53.5 on a held-out test corpus of mathematical sentences. It produces LaTeX documents that can be compiled to PDF without further editing. The system first converts the body of an input LaTeX document into English sentences containing math tokens, using the pandoc universal document converter to parse LaTeX input. We have trained a Transformer-based translator model, using OpenNMT, on a combined corpus containing a small proportion of domain-specific sentences. Our full system uses both this Transformer model and Google Translate, the latter being used as a backup to better handle linguistic features that do not appear in our training dataset. If the Transformer model does not have confidence in its translation, as determined by a high perplexity score, then we use Google Translate with a custom glossary. This backup was used 26% of the time on our test corpus of mathematical sentences. The PolyMath Translator is available as a web service at www.polymathtrans.ai.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源