按照节奏预测的自我监督视频表示学习

论文标题

按照节奏预测的自我监督视频表示学习

Self-supervised Video Representation Learning by Pace Prediction

论文作者

Wang, Jiangliu, Jiao, Jianbo, Liu, Yun-Hui

论文摘要

本文通过视频节奏预测从新的角度解决了自我监督视频表示学习的问题。它源于观察到人类视觉系统对视频速度敏感，例如慢动作，这是一种广泛使用的胶片制作技术。具体而言，鉴于以自然节奏播放的视频，我们以不同的节奏进行了随机采样训练剪辑，并要求神经网络确定每个视频剪辑的节奏。这里的假设是，当网络理解基本视频内容并学习代表性时空特征时，该网络只能在这样的节奏推理任务中取得成功。此外，我们进一步介绍了对比度学习，以通过对相似的视频内容达成协议来推动模型来区分不同的步调。为了验证所提出的方法的有效性，我们使用多种替代网络体系结构进行了有关动作识别和视频检索任务的广泛实验。实验评估表明，我们的方法实现了跨不同网络体系结构和不同基准的自我监督视频表示学习的最新性能。代码和预培训模型可在https://github.com/laura-wang/video-pace上找到。

This paper addresses the problem of self-supervised video representation learning from a new perspective -- by video pace prediction. It stems from the observation that human visual system is sensitive to video pace, e.g., slow motion, a widely used technique in film making. Specifically, given a video played in natural pace, we randomly sample training clips in different paces and ask a neural network to identify the pace for each video clip. The assumption here is that the network can only succeed in such a pace reasoning task when it understands the underlying video content and learns representative spatio-temporal features. In addition, we further introduce contrastive learning to push the model towards discriminating different paces by maximizing the agreement on similar video content. To validate the effectiveness of the proposed method, we conduct extensive experiments on action recognition and video retrieval tasks with several alternative network architectures. Experimental evaluations show that our approach achieves state-of-the-art performance for self-supervised video representation learning across different network architectures and different benchmarks. The code and pre-trained models are available at https://github.com/laura-wang/video-pace.

下载PDF全文

下载文献需遵守相关版权规定

论文标题