DENT-DDSP：使用可区分的数字信号处理器进行显式失真建模和噪声式语音识别的数据效率嘈杂的语音生成器

论文标题

DENT-DDSP：使用可区分的数字信号处理器进行显式失真建模和噪声式语音识别的数据效率嘈杂的语音生成器

DENT-DDSP: Data-efficient noisy speech generator using differentiable digital signal processors for explicit distortion modelling and noise-robust speech recognition

论文作者

Guo, Z., Chen, C., Chng, E. S.

论文摘要

在嘈杂条件下，自动语音识别（ASR）系统的性能急剧降低。作为特征补偿步骤，显式失真建模（EDM）能够通过模拟清洁对应物中的内域嘈杂演讲来增强ASR系统。但是，现有的失真模型是不可验证的，也无法解释的，并且通常缺乏可控性和概括能力。在本文中，我们提出了一个完全可解释的可控模型：DENT-DDSP以实现EDM。 DENT-DDSP利用了新型的可区分数字信号处理（DDSP）组件，仅需要10秒钟的训练数据才能实现高保真度。该实验表明，与其他基线模型相比，在多尺度光谱损耗（MSSL）方面，来自DENT-DDSP的模拟数据与其他基线模型相比达到了最高的模拟保真度。此外，为了验证DENT-DDSP模拟的数据是否能够替换噪声稳定ASR任务中稀缺的内域嘈杂数据，使用模拟数据和真实数据对具有相同体系结构的几种下游ASR模型进行了培训。该实验表明，使用DENT-DDSP的模拟嘈杂数据训练的模型可以在单词错误率（WER）方面具有2.7 \％的差异，其性能与基准相似。该模型的代码在线发布。

The performances of automatic speech recognition (ASR) systems degrade drastically under noisy conditions. Explicit distortion modelling (EDM), as a feature compensation step, is able to enhance ASR systems under such conditions by simulating the in-domain noisy speeches from the clean counterparts. Yet, existing distortion models are either non-trainable or unexplainable and often lack controllability and generalization ability. In this paper, we propose a fully explainable and controllable model: DENT-DDSP to achieve EDM. DENT-DDSP utilizes novel differentiable digital signal processing (DDSP) components and requires only 10 seconds of training data to achieve high fidelity. The experiment shows that the simulated noisy data from DENT-DDSP achieves the highest simulation fidelity compared to other baseline models in terms of multi-scale spectral loss (MSSL). Moreover, to validate whether the data simulated by DENT-DDSP are able to replace the scarce in-domain noisy data in the noise-robust ASR tasks, several downstream ASR models with the same architecture are trained using the simulated data and the real data. The experiment shows that the model trained with the simulated noisy data from DENT-DDSP achieves similar performances to the benchmark with a 2.7\% difference in terms of word error rate (WER). The code of the model is released online.

下载PDF全文

下载文献需遵守相关版权规定

论文标题