论文标题
子词正则化:端到端自动语音识别的可伸缩性和概括分析
Subword Regularization: An Analysis of Scalability and Generalization for End-to-End Automatic Speech Recognition
论文作者
论文摘要
子词是端到端语音识别中使用最广泛的输出单元。他们通过直接建模大多数频繁的单词来结合两个世界中最好的世界,同时允许通过备份较短的单元或角色来构造训练期间看不见的单词,从而允许开放的词汇识别。但是,将文本映射到子字是模棱两可的,通常可以进行多个分割变体。但是,许多系统仅使用最可能的细分培训。最近的研究表明,在培训期间对子单词进行采样,作为神经机器翻译和语音识别模型的常规化器,从而改善了性能。在这项工作中,我们对流媒体端到端语音识别任务的子词分割采样方法的正则效应进行了有原则的调查。特别是,我们根据培训数据集的大小来评估子词正则贡献。我们的结果表明,子字正则化提供了(2-8%)相对单词误差降低的一致改进,即使在大规模设置中,数据集则达到20K小时。此外,我们分析了子词正则化对看不见单词的识别及其对光束多样性的影响的影响。
Subwords are the most widely used output units in end-to-end speech recognition. They combine the best of two worlds by modeling the majority of frequent words directly and at the same time allow open vocabulary speech recognition by backing off to shorter units or characters to construct words unseen during training. However, mapping text to subwords is ambiguous and often multiple segmentation variants are possible. Yet, many systems are trained using only the most likely segmentation. Recent research suggests that sampling subword segmentations during training acts as a regularizer for neural machine translation and speech recognition models, leading to performance improvements. In this work, we conduct a principled investigation on the regularizing effect of the subword segmentation sampling method for a streaming end-to-end speech recognition task. In particular, we evaluate the subword regularization contribution depending on the size of the training dataset. Our results suggest that subword regularization provides a consistent improvement of (2-8%) relative word-error-rate reduction, even in a large-scale setting with datasets up to a size of 20k hours. Further, we analyze the effect of subword regularization on recognition of unseen words and its implications on beam diversity.