在文本到语音综合语言中对日本PNG BERT语言模型的调查

论文标题

在文本到语音综合语言中对日本PNG BERT语言模型的调查

Investigation of Japanese PnG BERT language model in text-to-speech synthesis for pitch accent language

论文作者

Yasuda, Yusuke, Toda, Tomoki

论文摘要

端到端文本到语音综合（TTS）可以从原始文本中产生高自然的综合语音。但是，对于端到端的TT，呈现正确的音高口音仍然是一个具有挑战性的问题。为了应对日本端到端TT中正确音调重音的挑战，我们采用了PNG〜Bert，这是TTS的角色和音素域中的自我监督预定的模型。我们通过修改微调条件来确定有助于推断螺距重音的条件，研究了PNG〜BERT捕获的特征对日本TT的影响。我们通过更改TTS期间的微调层数量来操纵PNG〜BERT特征的含量，从以文本为导向到语音为导向。此外，我们通过对音调预测进行微调作为另一个下游任务来教授png〜bert音调强调信息。我们的实验结果表明，通过预处理捕获的PNG〜BERT的特征包含有助于推断音高重音的信息，而PNG〜BERT在听力测试中的重音正确性优于基线TACOTRON。

End-to-end text-to-speech synthesis (TTS) can generate highly natural synthetic speech from raw text. However, rendering the correct pitch accents is still a challenging problem for end-to-end TTS. To tackle the challenge of rendering correct pitch accent in Japanese end-to-end TTS, we adopt PnG~BERT, a self-supervised pretrained model in the character and phoneme domain for TTS. We investigate the effects of features captured by PnG~BERT on Japanese TTS by modifying the fine-tuning condition to determine the conditions helpful inferring pitch accents. We manipulate content of PnG~BERT features from being text-oriented to speech-oriented by changing the number of fine-tuned layers during TTS. In addition, we teach PnG~BERT pitch accent information by fine-tuning with tone prediction as an additional downstream task. Our experimental results show that the features of PnG~BERT captured by pretraining contain information helpful inferring pitch accent, and PnG~BERT outperforms baseline Tacotron on accent correctness in a listening test.

下载PDF全文

下载文献需遵守相关版权规定

论文标题