声学邻居的嵌入

论文标题

声学邻居的嵌入

Acoustic Neighbor Embeddings

论文作者

Jeon, Woojay

论文摘要

本文提出了一种新颖的声词嵌入，称为声学邻居嵌入，其中的语音或任意长度的语音或文本通过将随机邻居嵌入（SNE）调整为顺序输入来映射到固定，减小尺寸的矢量空间。嵌入空间中坐标之间的欧几里得距离反映了其相应序列之间的语音混淆性。训练了两个编码器神经网络：一种声学编码器，该声音编码器接受语音信号的形式，以框架的子词后验概率的形式从声学模型获得，并以子词转录形式接受文本的文本编码器。与三胞胎损失标准相比，所提出的方法显示具有更有效的神经网络训练梯度。在实验上，当两个编码器网络在单词（名称）识别任务中串联使用时，并且在近似语音匹配任务中使用文本编码器网络时，它还可以通过低维嵌入提供更准确的结果。特别是，在隔离的名称识别任务中，仅取决于所提出的嵌入向量之间的欧几里得最近的邻居搜索，识别精度与常规有限状态换能器（FST）基于基于词汇和40个尺寸的均具有100万个名称的测试数据相同。

This paper proposes a novel acoustic word embedding called Acoustic Neighbor Embeddings where speech or text of arbitrary length are mapped to a vector space of fixed, reduced dimensions by adapting stochastic neighbor embedding (SNE) to sequential inputs. The Euclidean distance between coordinates in the embedding space reflects the phonetic confusability between their corresponding sequences. Two encoder neural networks are trained: an acoustic encoder that accepts speech signals in the form of frame-wise subword posterior probabilities obtained from an acoustic model and a text encoder that accepts text in the form of subword transcriptions. Compared to a triplet loss criterion, the proposed method is shown to have more effective gradients for neural network training. Experimentally, it also gives more accurate results with low-dimensional embeddings when the two encoder networks are used in tandem in a word (name) recognition task, and when the text encoder network is used standalone in an approximate phonetic matching task. In particular, in an isolated name recognition task depending solely on Euclidean nearest-neighbor search between the proposed embedding vectors, the recognition accuracy is identical to that of conventional finite state transducer(FST)-based decoding using test data with up to 1 million names in the vocabulary and 40 dimensions in the embeddings.

下载PDF全文

下载文献需遵守相关版权规定

论文标题