语音学如何影响多语言和零射ASR性能

论文标题

语音学如何影响多语言和零射ASR性能

How Phonotactics Affect Multilingual and Zero-shot ASR Performance

论文作者

Feng, Siyuan, Żelasko, Piotr, Moro-Velázquez, Laureano, Abavisani, Ali, Hasegawa-Johnson, Mark, Scharenborg, Odette, Dehak, Najim

论文摘要

将多种语言的录音结合起来训练单个自动语音识别（ASR）模型的想法带来了普遍语音表示出现的希望。最近，已经显示出一种变压器编码器模型在训练过程中提出的语言的IPA转录中很好地利用了多语言数据。但是，它所学的表示并未成功地转移到看不见的语言中。由于该模型缺乏对声学模型（AM）和语言模型（LM）的明确分解，因此尚不清楚表现在发音或音调不匹配的差异何种程度。为了进一步了解限制零射ASR转移的因素，我们用由单独的AM和LM组成的混合ASR系统替换编码器描述器。然后，我们对一组13种语音多样性的语言对单语，多语言和跨语言和语言模型进行了广泛的评估。我们表明，建模跨语言音调学是有限的，并且强大的模型会损害零射击的转移。此外，我们发现多语言LM会伤害多语种ASR系统的性能，并且仅保留目标语言的语音数据在LM培训中是可取的。

The idea of combining multiple languages' recordings to train a single automatic speech recognition (ASR) model brings the promise of the emergence of universal speech representation. Recently, a Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training. However, the representations it learned were not successful in zero-shot transfer to unseen languages. Because that model lacks an explicit factorization of the acoustic model (AM) and language model (LM), it is unclear to what degree the performance suffered from differences in pronunciation or the mismatch in phonotactics. To gain more insight into the factors limiting zero-shot ASR transfer, we replace the encoder-decoder with a hybrid ASR system consisting of a separate AM and LM. Then, we perform an extensive evaluation of monolingual, multilingual, and crosslingual (zero-shot) acoustic and language models on a set of 13 phonetically diverse languages. We show that the gain from modeling crosslingual phonotactics is limited, and imposing a too strong model can hurt the zero-shot transfer. Furthermore, we find that a multilingual LM hurts a multilingual ASR system's performance, and retaining only the target language's phonotactic data in LM training is preferable.

下载PDF全文

下载文献需遵守相关版权规定

论文标题