使用修改后的变压器网络端到端对自然语音转换的耳语

论文标题

使用修改后的变压器网络端到端对自然语音转换的耳语

End-to-End Whisper to Natural Speech Conversion using Modified Transformer Network

论文作者

Niranjan, Abhishek, Sharma, Mukesh, Gutha, Sai Bharath Chandra, Shaik, M Ali Basha

论文摘要

机器对低语语音等非典型演讲的认识是一项艰巨的任务。我们通过提出增强的变压器体系结构，使用序列到序列方法介绍了窃窃私语到天然的语音转换，该方法使用并行数据和非并行数据。我们研究了不同的特征，例如MEL频率Cepstral系数和光滑的光谱特征。提出的网络是使用有监督的方法进行功能与功能转换的端到端训练的。此外，我们还研究了n个编码器子层后使用的嵌入的辅助解码器的有效性，该解码器已通过框架级目标函数训练，用于识别源音素标签。我们通过使用端到端ASR测量单词错误率以及生成的语音的BLEU分数来显示OpenSource WTIMIT和链条数据集的结果。另外，我们还提出了一种新的方法，通过测量共振剂分布W.R.T.来测量其光谱形状。参考语音，作为共振剂发散度量。我们发现，耳语到天然转化的语音共振剂概率分布与地面图分布相似。据作者的最大知识，这是第一次提出增强的变压器，无论有没有辅助解码器，用于窃窃私语到自然的语音转换，反之亦然。

Machine recognition of an atypical speech like whispered speech, is a challenging task. We introduce whisper-to-natural-speech conversion using sequence-to-sequence approach by proposing enhanced transformer architecture, which uses both parallel and non-parallel data. We investigate different features like Mel frequency cepstral coefficients and smoothed spectral features. The proposed networks are trained end-to-end using supervised approach for feature-to-feature transformation. Further, we also investigate the effectiveness of embedded auxillary decoder used after N encoder sub-layers, trained with the frame-level objective function for identifying source phoneme labels. We show results on opensource wTIMIT and CHAINS datasets by measuring word error rate using end-to-end ASR and also BLEU scores for the generated speech. Alternatively, we also propose a novel method to measure spectral shape of it by measuring formant distributions w.r.t. reference speech, as formant divergence metric. We have found whisper-to-natural converted speech formants probability distribution is similar to the groundtruth distribution. To the authors' best knowledge, this is the first time enhanced transformer has been proposed, both with and without auxiliary decoder for whisper-to-natural-speech conversion and vice versa.

下载PDF全文

下载文献需遵守相关版权规定

论文标题