使用F0的无监督离散表示语音合成的韵律变化的感知

论文标题

使用F0的无监督离散表示语音合成的韵律变化的感知

Perception of prosodic variation for speech synthesis using an unsupervised discrete representation of F0

论文作者

Hodari, Zack, Lai, Catherine, King, Simon

论文摘要

在英语中，韵律为细分序列增加了广泛的信息，从信息结构（例如对比度）到风格变化（例如，情感的表达）。但是，当学习在文本到语音的声音中控制韵律时，尚不清楚控件正在修改什么。现有的关于韵律的离散表示学习的研究表明了自然性很高，但是对这些表示捕获的内容或是否可以产生有意义的话语变体，没有进行分析。我们使用模式中心作为“语调代码”提出了具有多模式先验的短语级变量自动编码器。我们的评估确定了哪些语调代码在感知上是不同的，发现来自我们多模式潜在模型的语调代码比使用K-均值聚类的基线明显不同。我们进行了一项后续定性研究，以确定代码携带的信息。最常见的是，听众对具有语句或问题样式的语调代码发表了评论。但是，还报道了许多其他与情感相关的样式，包括：情感，不确定，惊讶，讽刺，被动的侵略性和沮丧。

In English, prosody adds a broad range of information to segment sequences, from information structure (e.g. contrast) to stylistic variation (e.g. expression of emotion). However, when learning to control prosody in text-to-speech voices, it is not clear what exactly the control is modifying. Existing research on discrete representation learning for prosody has demonstrated high naturalness, but no analysis has been performed on what these representations capture, or if they can generate meaningfully-distinct variants of an utterance. We present a phrase-level variational autoencoder with a multi-modal prior, using the mode centres as "intonation codes". Our evaluation establishes which intonation codes are perceptually distinct, finding that the intonation codes from our multi-modal latent model were significantly more distinct than a baseline using k-means clustering. We carry out a follow-up qualitative study to determine what information the codes are carrying. Most commonly, listeners commented on the intonation codes having a statement or question style. However, many other affect-related styles were also reported, including: emotional, uncertain, surprised, sarcastic, passive aggressive, and upset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题