论文标题

贝叶斯学习深度神经网络适应

Bayesian Learning for Deep Neural Network Adaptation

论文作者

Xie, Xurong, Liu, Xunying, Lee, Tan, Wang, Lan

论文摘要

语音识别系统的关键任务是减少通常归因于说话者差异的培训和评估数据之间的不匹配。扬声器适应技术在减少不匹配方面起着至关重要的作用。基于模型的扬声器适应方法通常需要足够数量的目标扬声器数据以确保稳健性。当说话者级别的数据量有限时,说话者的适应性容易拟合和泛化。为了解决这个问题,本文提出了一个完全基于贝叶斯学习的DNN扬声器适应框架,以鉴于有限的说话者特定的适应性数据,依赖说话者的依赖说话者(SD)参数不确定性。该框架以基于模型的DNN适应技术的三种形式进行了研究:隐藏单位贡献的贝叶斯学习(BLHUC),贝叶斯参数化激活功能(BPACT)和贝叶斯隐藏的单位偏置矢量(BHUB)。在这三种方法中,确定性的SD参数被每个说话者的潜在变量后分布所取代,每个说话者的参数使用基于变异推理的方法有效地估算。在经过300小时的速度干扰板语料库训练有素的LF-MMI TDNN/CNN-TDNN系统上进行的实验表明,提出的贝叶斯适应方法始终优于NIST HUB5'00和RT03评估集对NIST HUB5'00和RT03评估集的确定性适应。当仅将每个说话者的前五个话语用作适应数据时,在Callhome子集中获得了大量的绝对单词错误率降低至1.4%(相对7.2%)。在与文献中报道的最新系统相比,与在同一任务中获得的最新性能进行比较,进一步证明了所提出的贝叶斯适应技术的功效。

A key task for speech recognition systems is to reduce the mismatch between training and evaluation data that is often attributable to speaker differences. Speaker adaptation techniques play a vital role to reduce the mismatch. Model-based speaker adaptation approaches often require sufficient amounts of target speaker data to ensure robustness. When the amount of speaker level data is limited, speaker adaptation is prone to overfitting and poor generalization. To address the issue, this paper proposes a full Bayesian learning based DNN speaker adaptation framework to model speaker-dependent (SD) parameter uncertainty given limited speaker specific adaptation data. This framework is investigated in three forms of model based DNN adaptation techniques: Bayesian learning of hidden unit contributions (BLHUC), Bayesian parameterized activation functions (BPAct), and Bayesian hidden unit bias vectors (BHUB). In the three methods, deterministic SD parameters are replaced by latent variable posterior distributions for each speaker, whose parameters are efficiently estimated using a variational inference based approach. Experiments conducted on 300-hour speed perturbed Switchboard corpus trained LF-MMI TDNN/CNN-TDNN systems suggest the proposed Bayesian adaptation approaches consistently outperform the deterministic adaptation on the NIST Hub5'00 and RT03 evaluation sets. When using only the first five utterances from each speaker as adaptation data, significant word error rate reductions up to 1.4% absolute (7.2% relative) were obtained on the CallHome subset. The efficacy of the proposed Bayesian adaptation techniques is further demonstrated in a comparison against the state-of-the-art performance obtained on the same task using the most recent systems reported in the literature.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源