多语言自动语音识别的频率注意力模型

论文标题

多语言自动语音识别的频率注意力模型

Frequency-Directional Attention Model for Multilingual Automatic Speech Recognition

论文作者

Dobashi, Akihiro, Leow, Chee Siang, Nishizaki, Hiromitsu

论文摘要

本文提出了一个模型，用于使用端到端（E2E）自动语音识别的频率注意模型转换语音特征。这个想法基于以下假设：在每种语言的音素系统中，说话时频率带的特征是不同的。通过使用特征频率方向的注意模型转换输入MEL滤波器库特征，可以预期适合每种语言中ASR的功能转换。本文引入了变压器编码器作为频率注意模型。我们评估了六种不同语言的多语言E2E ASR系统上提出的方法，发现该方法平均可以通过引入频率方向注意机制来实现每种语言的ASR模型的准确性5.3点。此外，基于提出的方法对注意力权重的可视化表明，考虑到每种语言的频率特征，可以改变声学特征。

This paper proposes a model for transforming speech features using the frequency-directional attention model for End-to-End (E2E) automatic speech recognition. The idea is based on the hypothesis that in the phoneme system of each language, the characteristics of the frequency bands of speech when uttering them are different. By transforming the input Mel filter bank features with an attention model that characterizes the frequency direction, a feature transformation suitable for ASR in each language can be expected. This paper introduces a Transformer-encoder as a frequency-directional attention model. We evaluated the proposed method on a multilingual E2E ASR system for six different languages and found that the proposed method could achieve, on average, 5.3 points higher accuracy than the ASR model for each language by introducing the frequency-directional attention mechanism. Furthermore, visualization of the attention weights based on the proposed method suggested that it is possible to transform acoustic features considering the frequency characteristics of each language.

下载PDF全文

下载文献需遵守相关版权规定

论文标题