论文标题
流媒体语音识别的变压者访问者的联合音频/文本培训
Joint Audio/Text Training for Transformer Rescorer of Streaming Speech Recognition
论文作者
论文摘要
最近,对两次通行流端到端语音识别(ASR)的兴趣越来越多,该识别(ASR)在传统的1届通道流媒体ASR模型上结合了第二频道的撤退模型,以提高识别准确性,同时保持潜伏期较低。变形金刚的最新第二次撤退模型之一是,从第一届通行证模型中采用N-最初的初始输出和音频嵌入,然后通过重新评分N最佳初始输出来选择最佳输出。但是,训练此变压器委员需要昂贵的配对音频训练数据,因为该模型使用音频嵌入作为输入。在这项工作中,我们介绍了针对变形金刚委员会的联合音频/文本培训方法,以利用与配对的音频文本数据相对便宜的未配对文本数据。我们通过在LiblisPeech数据集上的联合音频/文本培训以及我们的大规模内部数据集评估了变形金刚委员,并表明我们的培训方法可以显着提高单词错误率(WER),而无需任何额外的模型参数或延迟。
Recently, there has been an increasing interest in two-pass streaming end-to-end speech recognition (ASR) that incorporates a 2nd-pass rescoring model on top of the conventional 1st-pass streaming ASR model to improve recognition accuracy while keeping latency low. One of the latest 2nd-pass rescoring model, Transformer Rescorer, takes the n-best initial outputs and audio embeddings from the 1st-pass model, and then choose the best output by re-scoring the n-best initial outputs. However, training this Transformer Rescorer requires expensive paired audio-text training data because the model uses audio embeddings as input. In this work, we present our Joint Audio/Text training method for Transformer Rescorer, to leverage unpaired text-only data which is relatively cheaper than paired audio-text data. We evaluate Transformer Rescorer with our Joint Audio/Text training on Librispeech dataset as well as our large-scale in-house dataset and show that our training method can improve word error rate (WER) significantly compared to standard Transformer Rescorer without requiring any extra model parameters or latency.