论文标题
通过群集进行多语言机器翻译的快速词汇投影方法
Fast Vocabulary Projection Method via Clustering for Multilingual Machine Translation on GPU
论文作者
论文摘要
使用变压器模型,多语言神经机器的翻译表现出巨大的成功。部署这些模型是具有挑战性的,因为它们通常需要各种语言的大词汇(词汇)尺寸。这限制了在上一个词汇投影层中预测输出令牌的速度。为了减轻这些挑战,本文通过聚类提出了一种快速的词汇投影方法,该方法可用于GPU上的多语言变压器。首先,考虑到解码器输出的隐藏上下文向量,我们将词汇搜索空间脱机分为不相交的群集,这导致词汇投影的词汇列要小得多。其次,在推理时,提出的方法可以预测词汇投影中隐藏上下文向量的簇和候选值。本文还包括对在多语言环境中构建这些群集的不同方式的分析。我们的结果显示,FLOAT16 GPU推断中的端到端速度增长高达25%,同时保持BLEU得分并略有增加内存成本。所提出的方法将词汇投影步骤提高到2.6倍。我们还进行了广泛的人类评估,以验证所提出的方法保留了原始模型的翻译质量。
Multilingual Neural Machine Translation has been showing great success using transformer models. Deploying these models is challenging because they usually require large vocabulary (vocab) sizes for various languages. This limits the speed of predicting the output tokens in the last vocab projection layer. To alleviate these challenges, this paper proposes a fast vocabulary projection method via clustering which can be used for multilingual transformers on GPUs. First, we offline split the vocab search space into disjoint clusters given the hidden context vector of the decoder output, which results in much smaller vocab columns for vocab projection. Second, at inference time, the proposed method predicts the clusters and candidate active tokens for hidden context vectors at the vocab projection. This paper also includes analysis of different ways of building these clusters in multilingual settings. Our results show end-to-end speed gains in float16 GPU inference up to 25% while maintaining the BLEU score and slightly increasing memory cost. The proposed method speeds up the vocab projection step itself by up to 2.6x. We also conduct an extensive human evaluation to verify the proposed method preserves the quality of the translations from the original model.