学习数字嵌入

论文标题

学习数字嵌入

Learning Numeral Embeddings

论文作者

Jiang, Chengyue, Nian, Zhonglin, Guo, Kaihao, Chu, Shanbo, Zhao, Yinggong, Shen, Libin, Tu, Kewei

论文摘要

单词嵌入是自然语言处理深度学习方法的重要组成部分。尽管多年来已经对单词嵌入进行了广泛的研究，但是如何有效嵌入数字（特殊单词）的问题仍未得到充实。现有的单词嵌入方法不能很好地学习数字嵌入，因为有数字数量的数字及其在培训语料库中的单独出现非常稀缺。在本文中，我们提出了两种新型的数字嵌入方法，这些方法可以解决数字量不足（OOV）问题。我们首先使用自组织图或高斯混合模型诱导有限的原型数字集。然后，我们表示数字作为原型数字嵌入的加权平均值。以这种方式表示的数字嵌入可以插入现有的单词嵌入学习方法，例如Skip-gram供训练。我们评估了我们的方法，并在四个固有和外部任务上显示了其有效性：单词相似性，嵌入算术，数字预测和序列标记。

Word embedding is an essential building block for deep learning methods for natural language processing. Although word embedding has been extensively studied over the years, the problem of how to effectively embed numerals, a special subset of words, is still underexplored. Existing word embedding methods do not learn numeral embeddings well because there are an infinite number of numerals and their individual appearances in training corpora are highly scarce. In this paper, we propose two novel numeral embedding methods that can handle the out-of-vocabulary (OOV) problem for numerals. We first induce a finite set of prototype numerals using either a self-organizing map or a Gaussian mixture model. We then represent the embedding of a numeral as a weighted average of the prototype number embeddings. Numeral embeddings represented in this manner can be plugged into existing word embedding learning approaches such as skip-gram for training. We evaluated our methods and showed its effectiveness on four intrinsic and extrinsic tasks: word similarity, embedding numeracy, numeral prediction, and sequence labeling.

下载PDF全文

下载文献需遵守相关版权规定

论文标题