功能学习和集合前任务的基于自我监督的语音DeNoising和Dereverberation

论文标题

功能学习和集合前任务的基于自我监督的语音DeNoising和Dereverberation

Feature Learning and Ensemble Pre-Tasks Based Self-Supervised Speech Denoising and Dereverberation

论文作者

Li, Yi, Li, ShuangLin, Sun, Yang, Naqvi, Syed Mohsen

论文摘要

自我监督的学习（SSL）在单声道语音增强方面取得了巨大的成功，而目标语音估计的准确性，尤其是对于看不见的说话者，仍然不足以与现有的预任务相关。由于语音信号包含多方面的信息，包括说话者身份，副语言学和口语内容，因此语音增强的潜在表示将成为一项艰巨的任务。在本文中，我们研究了每种特征的有效性，这些功能通常用于语音增强并利用SSL情况下的特征组合。此外，我们提出了合奏培训策略。与此同时，清理语音信号的潜在表示，将掩盖和估计的比率掩模被利用以denoise和消除混合物。在训练阶段，潜在的表示学习和口罩估计被认为是两个前任务。此外，为了研究预任务之间的有效性，我们比较了训练模型并进一步完善性能的不同训练程序。 nocex和daps Corpora用于评估所提出方法的功效，这也表现优于最新方法。

Self-supervised learning (SSL) achieves great success in monaural speech enhancement, while the accuracy of the target speech estimation, particularly for unseen speakers, remains inadequate with existing pre-tasks. As speech signal contains multi-faceted information including speaker identity, paralinguistics, and spoken content, the latent representation for speech enhancement becomes a tough task. In this paper, we study the effectiveness of each feature which is commonly used in speech enhancement and exploit the feature combination in the SSL case. Besides, we propose an ensemble training strategy. The latent representation of the clean speech signal is learned, meanwhile, the dereverberated mask and the estimated ratio mask are exploited to denoise and dereverberate the mixture. The latent representation learning and the masks estimation are considered as two pre-tasks in the training stage. In addition, to study the effectiveness between the pre-tasks, we compare different training routines to train the model and further refine the performance. The NOISEX and DAPS corpora are used to evaluate the efficacy of the proposed method, which also outperforms the state-of-the-art methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题