TTS引导的训练，用于无平行数据的重音转换

论文标题

TTS引导的训练，用于无平行数据的重音转换

TTS-Guided Training for Accent Conversion Without Parallel Data

论文作者

Zhou, Yi, Wu, Zhizheng, Zhang, Mingyang, Tian, Xiaohai, Li, Haizhou

论文摘要

口音转换（AC）试图在保留语音内容和说话者身份的同时，将语音口音从一个（源）（目标）更改为另一个（目标）。但是，许多AC方法依赖于源目标并行语音数据。我们提出了一个新颖的重音转换框架，而无需并行数据。具体而言，首先，通过目标重点的语音数据对文本到语音（TTS）系统首先仔细预测。该TTS模型及其隐藏的表示形式仅与目标重音相关联。然后，在验证的TTS模型的监督下，对语音编码器进行了培训，以转换语音的口音。这样一来，源语音及其相应的转录分别转发给语音编码器和预验证的TT。语音编码器的输出被优化为与TTS系统中的文本嵌入相同。在运行时，将语音编码器与验证的TTS解码器结合使用，以将源为目标的语音转换为目标。在实验中，我们用两个来源的口音（中文和印度）将英语转换为目标口音（美国/英国/加拿大）。客观指标和主观听力测试都成功地验证了该方法在没有任何平行数据的情况下，所提出的方法生成的语音样本接近具有高语音质量的目标口音。

Accent Conversion (AC) seeks to change the accent of speech from one (source) to another (target) while preserving the speech content and speaker identity. However, many AC approaches rely on source-target parallel speech data. We propose a novel accent conversion framework without the need of parallel data. Specifically, a text-to-speech (TTS) system is first pretrained with target-accented speech data. This TTS model and its hidden representations are expected to be associated only with the target accent. Then, a speech encoder is trained to convert the accent of the speech under the supervision of the pretrained TTS model. In doing so, the source-accented speech and its corresponding transcription are forwarded to the speech encoder and the pretrained TTS, respectively. The output of the speech encoder is optimized to be the same as the text embedding in the TTS system. At run-time, the speech encoder is combined with the pretrained TTS decoder to convert the source-accented speech toward the target. In the experiments, we converted English with two source accents (Chinese and Indian) to the target accent (American/British/Canadian). Both objective metrics and subjective listening tests successfully validate that, without any parallel data, the proposed approach generates speech samples that are close to the target accent with high speech quality.

下载PDF全文

下载文献需遵守相关版权规定

论文标题