扬声器改编的WAV2VEC2基于质心ASR

论文标题

扬声器改编的WAV2VEC2基于质心ASR

Speaker adaptation for Wav2vec2 based dysarthric ASR

论文作者

Baskar, Murali Karthick, Herzig, Tim, Nguyen, Diana, Diez, Mireia, Polzehl, Tim, Burget, Lukáš, Černocký, Jan "Honza''

论文摘要

由于缺乏培训数据和说话者特征的不匹配，违反语音识别构成了重大挑战。最近的ASR系统受益于随时可用的验证模型，例如WAV2VEC2，以提高识别性能。使用FMLLR和XVECTORS的演讲者适应，为违反语音提供了重大收益，而适应性数据很少。但是，在WAV2VEC2登录过程中，将WAV2VEC2与FMLLR特征或XVECTOR的集成尚未探索。在这项工作中，我们提出了一个简单的适应网络，用于使用FMLLR功能进行微调WAV2VEC2。适应网络也可以灵活地处理其他扬声器自适应功能，例如XVECTORS。实验分析表明，使用我们在所有损伤严重程度上提出的方法进行稳定的改进，并获得57.72 \％wer，以实现UASPEECH数据集的高度严重程度。我们还在德国数据集上进行了实验，以证实我们在不同领域提出的方法的一致性。

Dysarthric speech recognition has posed major challenges due to lack of training data and heavy mismatch in speaker characteristics. Recent ASR systems have benefited from readily available pretrained models such as wav2vec2 to improve the recognition performance. Speaker adaptation using fMLLR and xvectors have provided major gains for dysarthric speech with very little adaptation data. However, integration of wav2vec2 with fMLLR features or xvectors during wav2vec2 finetuning is yet to be explored. In this work, we propose a simple adaptation network for fine-tuning wav2vec2 using fMLLR features. The adaptation network is also flexible to handle other speaker adaptive features such as xvectors. Experimental analysis show steady improvements using our proposed approach across all impairment severity levels and attains 57.72\% WER for high severity in UASpeech dataset. We also performed experiments on German dataset to substantiate the consistency of our proposed approach across diverse domains.

下载PDF全文

下载文献需遵守相关版权规定

论文标题