语音评估和识别障碍的光谱时间深度特征

论文标题

语音评估和识别障碍的光谱时间深度特征

Spectro-Temporal Deep Features for Disordered Speech Assessment and Recognition

论文作者

Geng, Mengzhe, Liu, Shansong, Yu, Jianwei, Xie, Xurong, Hu, Shoukang, Ye, Zi, Jin, Zengrui, Liu, Xunying, Meng, Helen

论文摘要

迄今为止，自动认识言语无序仍然是一项高度挑战的任务。正常语音中通常发现的可变性来源，包括口音，年龄或性别，与言语障碍和严重程度不同的根本原因相比，在说话者之间产生了巨大的多样性。为此，演讲者的适应技术在当前的语音识别系统中起着至关重要的作用。由无序和正常言语之间的光谱上的水平差异激励，这些言语和正常言语的差异是在关节不精确，数量降低和清晰度下降，口语较慢，口语较慢以及增加失去障碍的动机，新颖的光谱子空间基础嵌入了由SVD的言语谱系的精确言语智能的智能和A的精确式言论化的深度嵌入，并提高了SVD的言语智能智能智能的智力智能。混合DNN和端到端无序的语音识别系统。在UASPEECH语料库上进行的实验表明，提出的Spectro-normor-normal-geep Adpopted Systems始终超过了基线I-Vector的适应性，最多可降低2.63％的绝对（8.6％相对）降低单词错误率（WER），或者随着数据的增强而降低。进一步应用了学习隐藏单元贡献（LHUC）的扬声器适应性。最终使用拟议的光谱基嵌入功能的扬声器调整系统在16个质心扬声器的Uapeech测试集中的总体总体上为25.6％

Automatic recognition of disordered speech remains a highly challenging task to date. Sources of variability commonly found in normal speech including accent, age or gender, when further compounded with the underlying causes of speech impairment and varying severity levels, create large diversity among speakers. To this end, speaker adaptation techniques play a vital role in current speech recognition systems. Motivated by the spectro-temporal level differences between disordered and normal speech that systematically manifest in articulatory imprecision, decreased volume and clarity, slower speaking rates and increased dysfluencies, novel spectro-temporal subspace basis embedding deep features derived by SVD decomposition of speech spectrum are proposed to facilitate both accurate speech intelligibility assessment and auxiliary feature based speaker adaptation of state-of-the-art hybrid DNN and end-to-end disordered speech recognition systems. Experiments conducted on the UASpeech corpus suggest the proposed spectro-temporal deep feature adapted systems consistently outperformed baseline i-Vector adaptation by up to 2.63% absolute (8.6% relative) reduction in word error rate (WER) with or without data augmentation. Learning hidden unit contribution (LHUC) based speaker adaptation was further applied. The final speaker adapted system using the proposed spectral basis embedding features gave an overall WER of 25.6% on the UASpeech test set of 16 dysarthric speakers

下载PDF全文

下载文献需遵守相关版权规定

论文标题