论文标题
强大的端到端扬声器诊断,通用神经聚类
Robust End-to-end Speaker Diarization with Generic Neural Clustering
论文作者
论文摘要
端到端的扬声器诊断方法在传统的模块化方法上表现出了出色的表现。为了进一步提高端到端说话者诊断对真实语音记录的性能,最近提出了将无监督的聚类算法与端到端神经腹泻模型相结合的作品。但是,这些方法具有许多缺点:1)无监督的聚类算法无法利用可用数据集中的监督; 2)探索的基于K-均基于K-均值的无监督算法经常遭受违反问题的困扰; 3)监督培训与无监督的推理之间存在不可避免的不匹配。在本文中,提出了一种强大的通用神经聚类方法,可以与任何块级预测指标集成,以完成完全监督的端到端说话者诊断模型。同样,通过利用复发性神经网络的序列建模能力,提出的神经聚类方法可以动态估计推理过程中说话者的数量。实验表明,在整合基于吸引子的块级预测指标时,所提出的神经聚类方法可以产生比在不匹配条件下的基于受约束的基于K-均值的聚类方法更好的诊断误差率(DER)。
End-to-end speaker diarization approaches have shown exceptional performance over the traditional modular approaches. To further improve the performance of the end-to-end speaker diarization for real speech recordings, recently works have been proposed which integrate unsupervised clustering algorithms with the end-to-end neural diarization models. However, these methods have a number of drawbacks: 1) The unsupervised clustering algorithms cannot leverage the supervision from the available datasets; 2) The K-means-based unsupervised algorithms that are explored often suffer from the constraint violation problem; 3) There is unavoidable mismatch between the supervised training and the unsupervised inference. In this paper, a robust generic neural clustering approach is proposed that can be integrated with any chunk-level predictor to accomplish a fully supervised end-to-end speaker diarization model. Also, by leveraging the sequence modelling ability of a recurrent neural network, the proposed neural clustering approach can dynamically estimate the number of speakers during inference. Experimental show that when integrating an attractor-based chunk-level predictor, the proposed neural clustering approach can yield better Diarization Error Rate (DER) than the constrained K-means-based clustering approaches under the mismatched conditions.