通用句子表示学习有条件蒙版的语言模型

论文标题

通用句子表示学习有条件蒙版的语言模型

Universal Sentence Representation Learning with Conditional Masked Language Model

论文作者

Yang, Ziyi, Yang, Yinfei, Cer, Daniel, Law, Jax, Darve, Eric

论文摘要

本文提出了一种新颖的培训方法，即有条件的蒙版语言建模（CMLM），以有效地学习大规模未标记的语料库的句子表示。 CMLM通过调节相邻句子的编码向量，将句子表示学习整合到MLM培训中。我们的英语CMLM模型在Senteval上实现了最先进的性能，甚至超过了使用监督信号学到的模型。作为一种完全无监督的学习方法，CMLM可以方便地扩展到广泛的语言和域。我们发现，与Bitext检索（BR）和自然语言推理（NLI）任务共同训练的多语言CMLM模型的表现优于先前的最先进的多语言模型，例如。跨语性语义搜索的基线模型上有10％的改善。我们探讨了学到的表示形式的相同语言偏见，并提出了一种简单，训练后和模型的不可知论方法，以从表示语言中删除语言识别信息，同时仍保留句子语义。

This paper presents a novel training method, Conditional Masked Language Modeling (CMLM), to effectively learn sentence representations on large scale unlabeled corpora. CMLM integrates sentence representation learning into MLM training by conditioning on the encoded vectors of adjacent sentences. Our English CMLM model achieves state-of-the-art performance on SentEval, even outperforming models learned using supervised signals. As a fully unsupervised learning method, CMLM can be conveniently extended to a broad range of languages and domains. We find that a multilingual CMLM model co-trained with bitext retrieval (BR) and natural language inference (NLI) tasks outperforms the previous state-of-the-art multilingual models by a large margin, e.g. 10% improvement upon baseline models on cross-lingual semantic search. We explore the same language bias of the learned representations, and propose a simple, post-training and model agnostic approach to remove the language identifying information from the representation while still retaining sentence semantics.

下载PDF全文

下载文献需遵守相关版权规定

论文标题