语音识别的统一流和非流程的两次端到端模型

论文标题

语音识别的统一流和非流程的两次端到端模型

Unified Streaming and Non-streaming Two-pass End-to-end Model for Speech Recognition

论文作者

Zhang, Binbin, Wu, Di, Yao, Zhuoyuan, Wang, Xiong, Yu, Fan, Yang, Chao, Guo, Liyong, Hu, Yaguang, Xie, Lei, Lei, Xin

论文摘要

在本文中，我们提出了一种新颖的两次通行方法，以在单个模型中统一流媒体和非流式端到端（E2E）语音识别。我们的模型采用了混合CTC/注意体系结构，其中修改了编码器中的构型层。我们提出了一种基于动态的块注意策略，以允许任意正确的上下文长度。在推理时，CTC解码器以流方式生成N最佳假设。仅通过更改块大小来轻松控制推理潜伏期。然后，注意力解码器会撤回CTC假设，以获得最终结果。这种有效的重新纠正过程导致句子级别的延迟很少。我们在开放的170小时Aishell-1数据集上进行的实验表明，所提出的方法可以简单有效地统一流和非流传输模型。在Aishell-1测试集上，与标准的非流式变压器相比，我们的统一模型在非流动ASR中的相对性格错误率（CER）降低了5.60％。同一模型在流媒体ASR系统中具有640ms延迟的CER 5.42％。

In this paper, we present a novel two-pass approach to unify streaming and non-streaming end-to-end (E2E) speech recognition in a single model. Our model adopts the hybrid CTC/attention architecture, in which the conformer layers in the encoder are modified. We propose a dynamic chunk-based attention strategy to allow arbitrary right context length. At inference time, the CTC decoder generates n-best hypotheses in a streaming way. The inference latency could be easily controlled by only changing the chunk size. The CTC hypotheses are then rescored by the attention decoder to get the final result. This efficient rescoring process causes very little sentence-level latency. Our experiments on the open 170-hour AISHELL-1 dataset show that, the proposed method can unify the streaming and non-streaming model simply and efficiently. On the AISHELL-1 test set, our unified model achieves 5.60% relative character error rate (CER) reduction in non-streaming ASR compared to a standard non-streaming transformer. The same model achieves 5.42% CER with 640ms latency in a streaming ASR system.

下载PDF全文

下载文献需遵守相关版权规定

论文标题