论文标题
在跨模式顺序数据中无监督的不匹配定位,并应用于错误发音定位
Unsupervised Mismatch Localization in Cross-Modal Sequential Data with Application to Mispronunciations Localization
论文作者
论文摘要
内容不匹配通常是在将一种模式的数据转换为另一种模式时发生的,例如大声朗读句子(目标文本)时,会产生错误发音的语言学习者(语音错误)。但是,大多数现有的对齐算法都认为,这两种方式所涉及的内容是完美匹配的,因此很难在语音和文本之间找到这种不匹配。在这项工作中,我们开发了一种无监督的学习算法,该算法可以推断内容不匹配的跨模式顺序数据之间的关系,尤其是对于语音文本序列。更具体地说,我们提出了一个层次结构的贝叶斯深度学习模型,称为不匹配定位变化自动编码器(ML-VAE),该模型将语音的生成过程分解为层次结构结构的潜在变量,表明两种方式之间的关系。由于离散的潜在变量涉及复杂的依赖性,因此训练这种模型非常具有挑战性。为了应对这一挑战,我们提出了一种新颖有效的训练程序,该程序在估计特定设计的不匹配定位有限状态受体(ML-FSA)和更新神经网络的参数的情况下,在估计离散潜在变量的艰苦分配之间交替。在这项工作中,我们专注于语音和文本的不匹配本地化问题,我们的实验结果表明,ML-VAE成功地定位了文本和语音之间的不匹配,而无需对模型培训进行人体注释。
Content mismatch usually occurs when data from one modality is translated to another, e.g. language learners producing mispronunciations (errors in speech) when reading a sentence (target text) aloud. However, most existing alignment algorithms assume that the content involved in the two modalities is perfectly matched, thus leading to difficulty in locating such mismatch between speech and text. In this work, we develop an unsupervised learning algorithm that can infer the relationship between content-mismatched cross-modal sequential data, especially for speech-text sequences. More specifically, we propose a hierarchical Bayesian deep learning model, dubbed mismatch localization variational autoencoder (ML-VAE), which decomposes the generative process of the speech into hierarchically structured latent variables, indicating the relationship between the two modalities. Training such a model is very challenging due to the discrete latent variables with complex dependencies involved. To address this challenge, we propose a novel and effective training procedure that alternates between estimating the hard assignments of the discrete latent variables over a specifically designed mismatch localization finite-state acceptor (ML-FSA) and updating the parameters of neural networks. In this work, we focus on the mismatch localization problem for speech and text, and our experimental results show that ML-VAE successfully locates the mismatch between text and speech, without the need for human annotations for model training.