用于代码转换语音识别的多编码器解码器变压器

论文标题

用于代码转换语音识别的多编码器解码器变压器

Multi-Encoder-Decoder Transformer for Code-Switching Speech Recognition

论文作者

Zhou, Xinyuan, Yılmaz, Emre, Long, Yanhua, Li, Yijie, Li, Haizhou

论文摘要

当说话者在单个句子中或跨句子中交替交替使用两种或多种语言的单词时，就会发生代码转换（CS）。 CS语音的自动语音识别（ASR）必须同时处理两种或多种语言。在这项研究中，我们提出了一个基于变压器的体系结构，该体系结构具有两个特定于对称语言的编码器，以捕获单个语言属性，以改善每种语言的声学表示。这些表示是使用解码器模块中特定于语言的多头注意机制组合的。每个编码器及其在解码器中相应的注意模块都使用较大的单语语料库进行预训练，以减轻有限的CS培训数据的影响。我们将这样的网络称为多编码器解码器（MED）体系结构。接缝语料库的实验表明，所提出的MED体系结构在CS评估集上以普通话和英语作为矩阵语言分别实现了10.2％和10.8％的相对错误率。

Code-switching (CS) occurs when a speaker alternates words of two or more languages within a single sentence or across sentences. Automatic speech recognition (ASR) of CS speech has to deal with two or more languages at the same time. In this study, we propose a Transformer-based architecture with two symmetric language-specific encoders to capture the individual language attributes, that improve the acoustic representation of each language. These representations are combined using a language-specific multi-head attention mechanism in the decoder module. Each encoder and its corresponding attention module in the decoder are pre-trained using a large monolingual corpus aiming to alleviate the impact of limited CS training data. We call such a network a multi-encoder-decoder (MED) architecture. Experiments on the SEAME corpus show that the proposed MED architecture achieves 10.2% and 10.8% relative error rate reduction on the CS evaluation sets with Mandarin and English as the matrix language respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题