言语驱动的说话面部从单个图像和情感状况产生

论文标题

言语驱动的说话面部从单个图像和情感状况产生

Speech Driven Talking Face Generation from a Single Image and an Emotion Condition

论文作者

Eskimez, Sefik Emre, Zhang, You, Duan, Zhiyao

论文摘要

视觉情感表达在视听语音交流中起着重要作用。在这项工作中，我们提出了一种新颖的方法，可以在语音驱动的说话面部产生中呈现视觉情感的表达。具体来说，我们设计了一种端到端的说话面部生成系统，该系统采用语音话语，单个面部图像和一个类别的情感标签，以使会说话的面孔视频与演讲同步并表达有条件的情感。对图像质量，视听同步和视觉情感表达的客观评估表明，所提出的系统的表现优于最先进的基线系统。视觉情感表达和视频现实性的主观评估也证明了所提出的系统的优势。此外，我们使用在音频和视觉方式中的情绪不匹配的视频进行了人类的情感识别试点研究。结果表明，与此任务上的音频方式相比，人类对视觉方式的反应更为显着。

Visual emotion expression plays an important role in audiovisual speech communication. In this work, we propose a novel approach to rendering visual emotion expression in speech-driven talking face generation. Specifically, we design an end-to-end talking face generation system that takes a speech utterance, a single face image, and a categorical emotion label as input to render a talking face video synchronized with the speech and expressing the conditioned emotion. Objective evaluation on image quality, audiovisual synchronization, and visual emotion expression shows that the proposed system outperforms a state-of-the-art baseline system. Subjective evaluation of visual emotion expression and video realness also demonstrates the superiority of the proposed system. Furthermore, we conduct a human emotion recognition pilot study using generated videos with mismatched emotions among the audio and visual modalities. Results show that humans respond to the visual modality more significantly than the audio modality on this task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题