论文标题
4D ASR:CTC,注意力,换能器和蒙版预测解码器的联合建模
4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict decoders
论文作者
论文摘要
端到端(E2E)自动语音识别(ASR)的网络体系结构可以分为几种模型,包括连接派时间分类(CTC),复发性神经网络传感器(RNN-T),注意机制和非自动性面膜预测模型。由于这些网络体系结构中的每一个都有利弊,因此,典型的用例是根据应用程序要求切换这些单独的模型,从而增加了维护所有模型的开销。已经提出了几种整合这些互补模型以减轻开销问题的方法;但是,如果我们整合了更多的模型,我们将从这些互补模型中进一步受益,并通过单个系统实现更广泛的应用程序。本文提出了CTC,注意力,RNN-T和Mask Predict的四个二十个关节建模(4D),其具有以下三个优点:1)共同训练了四个解码器,因此可以根据应用方案轻松切换它们。 2)联合培训可能会带来模型正则化并提高模型鲁棒性,从而提高其互补特性。 3)使用CTC,注意力和RNN-T的新型单通道联合解码方法进一步提高了性能。实验结果表明,提出的模型始终减少了WER。
The network architecture of end-to-end (E2E) automatic speech recognition (ASR) can be classified into several models, including connectionist temporal classification (CTC), recurrent neural network transducer (RNN-T), attention mechanism, and non-autoregressive mask-predict models. Since each of these network architectures has pros and cons, a typical use case is to switch these separate models depending on the application requirement, resulting in the increased overhead of maintaining all models. Several methods for integrating two of these complementary models to mitigate the overhead issue have been proposed; however, if we integrate more models, we will further benefit from these complementary models and realize broader applications with a single system. This paper proposes four-decoder joint modeling (4D) of CTC, attention, RNN-T, and mask-predict, which has the following three advantages: 1) The four decoders are jointly trained so that they can be easily switched depending on the application scenarios. 2) Joint training may bring model regularization and improve the model robustness thanks to their complementary properties. 3) Novel one-pass joint decoding methods using CTC, attention, and RNN-T further improves the performance. The experimental results showed that the proposed model consistently reduced the WER.