自我监督的变压器以进行深击检测

论文标题

自我监督的变压器以进行深击检测

Self-supervised Transformer for Deepfake Detection

论文作者

Zhao, Hanqing, Zhou, Wenbo, Chen, Dongdong, Zhang, Weiming, Yu, Nenghai

论文摘要

在现实世界中，深层技术的快速发展和广泛的发展需要更强的面部伪造探测器的概括能力。一些作品捕获了与特定于方法特定的工件无关的特征，例如混合边界的线索，积累上采样，以增强概括能力。但是，这些方法的有效性很容易通过后处理操作（例如压缩）来破坏。受到转移学习的启发，在其他大规模面部相关任务上进行了预训练的神经网络可能为DeepFake检测提供了有用的功能。例如，唇部运动已被证明是一种强大而良好的传递高级语义功能，可以从唇线上的任务中学到。但是，现有方法以监督方式预先培训唇部提取模型，这需要大量的数据注释中的人力资源，并增加了获得培训数据的困难。在本文中，我们提出了一种基于自我监督的变压器的视听对比学习方法。提出的方法通过鼓励配对的视频和音频表示形式来学习口腔运动表示形式，而不配对的视频则是多样的。在使用我们的方法进行预训练之后，该模型将部分进行微调以进行深泡检测任务。广泛的实验表明，我们的自我监督方法的性能比受监督的预训练对应物的表现相当甚至更好。

The fast evolution and widespread of deepfake techniques in real-world scenarios require stronger generalization abilities of face forgery detectors. Some works capture the features that are unrelated to method-specific artifacts, such as clues of blending boundary, accumulated up-sampling, to strengthen the generalization ability. However, the effectiveness of these methods can be easily corrupted by post-processing operations such as compression. Inspired by transfer learning, neural networks pre-trained on other large-scale face-related tasks may provide useful features for deepfake detection. For example, lip movement has been proved to be a kind of robust and good-transferring highlevel semantic feature, which can be learned from the lipreading task. However, the existing method pre-trains the lip feature extraction model in a supervised manner, which requires plenty of human resources in data annotation and increases the difficulty of obtaining training data. In this paper, we propose a self-supervised transformer based audio-visual contrastive learning method. The proposed method learns mouth motion representations by encouraging the paired video and audio representations to be close while unpaired ones to be diverse. After pre-training with our method, the model will then be partially fine-tuned for deepfake detection task. Extensive experiments show that our self-supervised method performs comparably or even better than the supervised pre-training counterpart.

下载PDF全文

下载文献需遵守相关版权规定

论文标题