论文标题

未来带来的东西:调查​​LookAhead对增量神经TT的影响

What the Future Brings: Investigating the Impact of Lookahead for Incremental Neural TTS

论文作者

Stephenson, Brooke, Besacier, Laurent, Girin, Laurent, Hueber, Thomas

论文摘要

在语音合成(ITTS)的增量文本中,合成器在访问整个输入句子之前会产生音频输出。在本文中,我们研究了以增量模式使用的神经序列到序列TTS系统的行为,即在为令牌n生成语音输出时,系统可以从文本序列访问N + K令牌。我们首先分析了该增量策略对代币n的编码器表示的演变的影响(lookahead参数)。结果表明,平均而言,代币以一个单词的lookahead和2个单词后的94%的方式前往其完整上下文表示的88%。然后,我们研究了哪些文本特征是使用随机森林分析对最终表示的进化最有影响力的。结果表明,最显着的因素与令牌长度有关。我们最终使用Mushra听力测试评估了LookAhead K在解码器级别的效果。该测试表明结果表明,与上述高数字形成鲜明对比:用2个单词bookahead获得的语音合成质量明显低于完整句子的质量。

In incremental text to speech synthesis (iTTS), the synthesizer produces an audio output before it has access to the entire input sentence. In this paper, we study the behavior of a neural sequence-to-sequence TTS system when used in an incremental mode, i.e. when generating speech output for token n, the system has access to n + k tokens from the text sequence. We first analyze the impact of this incremental policy on the evolution of the encoder representations of token n for different values of k (the lookahead parameter). The results show that, on average, tokens travel 88% of the way to their full context representation with a one-word lookahead and 94% after 2 words. We then investigate which text features are the most influential on the evolution towards the final representation using a random forest analysis. The results show that the most salient factors are related to token length. We finally evaluate the effects of lookahead k at the decoder level, using a MUSHRA listening test. This test shows results that contrast with the above high figures: speech synthesis quality obtained with 2 word-lookahead is significantly lower than the one obtained with the full sentence.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源