变压器++

论文标题

Transformer++

论文作者

Thapak, Prakhar, Hore, Prodip

论文摘要

注意机制的最新进展取代了复发性神经网络及其用于机器翻译任务的变体。使用注意机制的变压器仅实现了最先进的方法，从而导致序列建模。基于注意力机制的神经机器翻译是可行的，它比复发性神经网络更有效地解决了在句子中处理长距离依赖的问题。注意力的关键概念之一是学习三个矩阵，查询，钥匙和价值，其中通过这些矩阵线性地投射单词嵌入，在单词之间学习的全局依赖性。可以同时学习多个查询，密钥，值矩阵，将重点放在嵌入式维度的不同子空间上，这在变压器中称为多头。我们认为，与直接建模单词依赖性相比，可以通过中间上下文更好地学习单词之间的某些依赖性。这可能是由于某些依赖性的性质或缺乏模式的性质而发生的，这些模式很难使用多头自我注意力为全球建模。在这项工作中，我们通过使用卷积在多头上的上下文中提出了一种新的学习依赖性方式。这种新形式的多头关注以及传统形式的成果比在WMT 2014英语至德法和英语到法国翻译任务上的变压器更好。我们还引入了一个框架，以在编码器培训期间学习POS标记和NER信息，该框架进一步改善了成果，获得了32.1 BLEU的新最新面积，比在WMT 2014英语对---German和44.6 Bleu上比现有的1.4 BLEU更好，比现有的44.6 BLEU，在WMT 2014英语至2014年的bleu上，其优于1.1 BLEU。我们称此变压器++。

Recent advancements in attention mechanisms have replaced recurrent neural networks and its variants for machine translation tasks. Transformer using attention mechanism solely achieved state-of-the-art results in sequence modeling. Neural machine translation based on the attention mechanism is parallelizable and addresses the problem of handling long-range dependencies among words in sentences more effectively than recurrent neural networks. One of the key concepts in attention is to learn three matrices, query, key, and value, where global dependencies among words are learned through linearly projecting word embeddings through these matrices. Multiple query, key, value matrices can be learned simultaneously focusing on a different subspace of the embedded dimension, which is called multi-head in Transformer. We argue that certain dependencies among words could be learned better through an intermediate context than directly modeling word-word dependencies. This could happen due to the nature of certain dependencies or lack of patterns that lend them difficult to be modeled globally using multi-head self-attention. In this work, we propose a new way of learning dependencies through a context in multi-head using convolution. This new form of multi-head attention along with the traditional form achieves better results than Transformer on the WMT 2014 English-to-German and English-to-French translation tasks. We also introduce a framework to learn POS tagging and NER information during the training of encoder which further improves results achieving a new state-of-the-art of 32.1 BLEU, better than existing best by 1.4 BLEU, on the WMT 2014 English-to-German and 44.6 BLEU, better than existing best by 1.1 BLEU, on the WMT 2014 English-to-French translation tasks. We call this Transformer++.

下载PDF全文

下载文献需遵守相关版权规定

论文标题

变压器++

Transformer++

论文作者

论文摘要

加入微信交流群