“我的演讲” - 将语音模式与多媒体融合

论文标题

“我的演讲” - 将语音模式与多媒体融合

"Notic My Speech" -- Blending Speech Patterns With Multimedia

论文作者

Sahrawat, Dhruva, Kumar, Yaman, Aggarwal, Shashwat, Yin, Yifang, Shah, Rajiv Ratn, Zimmermann, Roger

论文摘要

语音作为自然信号由三个部分（语音的视觉部分），音素（语音的口语）和语言（施加的结构）组成。但是，视频是传递语音和多媒体结构的媒介，大多数人忽略了语音传递的认知方面。例如，到目前为止，诸如转码和压缩之类的视频应用程序已经忽略了语音的传递和听到的事实。为了缩小语音理解与多媒体视频应用之间的差距，在本文中，我们通过对视觉语音的看法进行建模并在视频压缩上显示其用例，从而显示了初始实验。另一方面，在视觉语音识别域中，现有的研究主要将其建立为分类问题，同时忽略了观点，音素，观察和语音感知之间的相关性。这导致解决方案远离人类感知的工作方式。为了弥合这一差距，我们提出了一种观点的注意机制，以模拟语音识别和理解中的视图依赖性和视觉上的重要性。我们对三个公开视觉识别数据集进行实验。实验结果表明，就视觉错误率而言，我们提出的方法的表现优于现有工作4.99％。此外，我们表明，我们的模型对多视图言语的理解与人类的看法之间存在很强的相关性。这种特征有益于下游应用程序，例如视频压缩和流媒体，其中可以压缩或消除大量重要的框架，同时能够以良好的用户体验来最大程度地保护人的语音理解。

Speech as a natural signal is composed of three parts - visemes (visual part of speech), phonemes (spoken part of speech), and language (the imposed structure). However, video as a medium for the delivery of speech and a multimedia construct has mostly ignored the cognitive aspects of speech delivery. For example, video applications like transcoding and compression have till now ignored the fact how speech is delivered and heard. To close the gap between speech understanding and multimedia video applications, in this paper, we show the initial experiments by modelling the perception on visual speech and showing its use case on video compression. On the other hand, in the visual speech recognition domain, existing studies have mostly modeled it as a classification problem, while ignoring the correlations between views, phonemes, visemes, and speech perception. This results in solutions which are further away from how human perception works. To bridge this gap, we propose a view-temporal attention mechanism to model both the view dependence and the visemic importance in speech recognition and understanding. We conduct experiments on three public visual speech recognition datasets. The experimental results show that our proposed method outperformed the existing work by 4.99% in terms of the viseme error rate. Moreover, we show that there is a strong correlation between our model's understanding of multi-view speech and the human perception. This characteristic benefits downstream applications such as video compression and streaming where a significant number of less important frames can be compressed or eliminated while being able to maximally preserve human speech understanding with good user experience.

下载PDF全文

下载文献需遵守相关版权规定

论文标题