论文标题
序列到序列唇读的伪横向政策梯度
Pseudo-Convolutional Policy Gradient for Sequence-to-Sequence Lip-Reading
论文作者
论文摘要
唇部阅读旨在从唇部运动序列中推断语音内容,并被视为典型的序列到序列(SEQ2SEQ)问题,将唇部运动的输入图像序列转化为语音内容的文本序列。但是,SEQ2SEQ模型的传统学习过程始终遇到两个问题:暴露偏见是由“教师构成”的策略以及判别优化目标(通常是交叉透镜损失)与最终评估度量(通常是字符/单词错误率)之间的不一致。在本文中,我们提出了一种基于新型伪跨政策梯度(PCPG)的方法来解决这两个问题。一方面,我们介绍评估度量标准(指本文中的字符错误率),作为一种奖励形式,可以优化模型与原始判别目标。另一方面,受到卷积操作的本地感知属性的启发,我们对奖励和损失维度执行了伪横线操作,以便在每个时间步骤中考虑更多上下文,以为整个优化产生强大的奖励和损失。最后,我们对单词级别和句子级别的基准进行了详尽的比较和评估。结果表明,与其他相关方法相比有了显着改善,并在所有这些具有挑战性的基准上报告了新的最先进的性能或竞争精度,这显然证明了我们方法的优势。
Lip-reading aims to infer the speech content from the lip movement sequence and can be seen as a typical sequence-to-sequence (seq2seq) problem which translates the input image sequence of lip movements to the text sequence of the speech content. However, the traditional learning process of seq2seq models always suffers from two problems: the exposure bias resulted from the strategy of "teacher-forcing", and the inconsistency between the discriminative optimization target (usually the cross-entropy loss) and the final evaluation metric (usually the character/word error rate). In this paper, we propose a novel pseudo-convolutional policy gradient (PCPG) based method to address these two problems. On the one hand, we introduce the evaluation metric (refers to the character error rate in this paper) as a form of reward to optimize the model together with the original discriminative target. On the other hand, inspired by the local perception property of convolutional operation, we perform a pseudo-convolutional operation on the reward and loss dimension, so as to take more context around each time step into account to generate a robust reward and loss for the whole optimization. Finally, we perform a thorough comparison and evaluation on both the word-level and sentence-level benchmarks. The results show a significant improvement over other related methods, and report either a new state-of-the-art performance or a competitive accuracy on all these challenging benchmarks, which clearly proves the advantages of our approach.