VISAGESYNTALK：看不见的扬声器视频到语音综合，通过语音验证功能选择

论文标题

VISAGESYNTALK：看不见的扬声器视频到语音综合，通过语音验证功能选择

VisageSynTalk: Unseen Speaker Video-to-Speech Synthesis via Speech-Visage Feature Selection

论文作者

Hong, Joanna, Kim, Minsu, Ro, Yong Man

论文摘要

这项工作的目的是从无声说话的脸部视频中重建演讲。最近的研究表明，来自无声说话面部视频的综合语音表现令人印象深刻。但是，他们尚未明确考虑不同扬声器的不同身份特征，这些特征在视频到语音综合中面临挑战，这对于看不见的扬声器设置变得更加至关重要。我们的方法是将语音内容和VISAGE风格与给定的无声说话的脸部视频分开。通过指导模型独立专注于建模这两个表示形式，即使给出了看不见主题的输入视频，我们也可以从模型中获得高清晰度的语音。为此，我们介绍了语音验证选择，将语音内容和扬声器身份与输入视频的视觉特征分开。通过基于VISAGE风格的合成器共同纳入了分散的表示形式，以合成语音，该合成器通过在维护语音内容的同时涂上VISAGE风格来产生语音。因此，提议的框架带来了综合语音包含正确内容的优势，即使是一个看不见的主题的无声说话的脸部视频。我们验证了在网格，TCD-TIMIT志愿者和LRW数据集上提出的框架的有效性。

The goal of this work is to reconstruct speech from a silent talking face video. Recent studies have shown impressive performance on synthesizing speech from silent talking face videos. However, they have not explicitly considered on varying identity characteristics of different speakers, which place a challenge in the video-to-speech synthesis, and this becomes more critical in unseen-speaker settings. Our approach is to separate the speech content and the visage-style from a given silent talking face video. By guiding the model to independently focus on modeling the two representations, we can obtain the speech of high intelligibility from the model even when the input video of an unseen subject is given. To this end, we introduce speech-visage selection that separates the speech content and the speaker identity from the visual features of the input video. The disentangled representations are jointly incorporated to synthesize speech through visage-style based synthesizer which generates speech by coating the visage-styles while maintaining the speech content. Thus, the proposed framework brings the advantage of synthesizing the speech containing the right content even with the silent talking face video of an unseen subject. We validate the effectiveness of the proposed framework on the GRID, TCD-TIMIT volunteer, and LRW datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题