adafocusv3：在统一的时空动态视频识别上

论文标题

adafocusv3：在统一的时空动态视频识别上

AdaFocusV3: On Unified Spatial-temporal Dynamic Video Recognition

论文作者

Wang, Yulin, Yue, Yang, Xu, Xinhong, Hassani, Ali, Kulikov, Victor, Orlov, Nikita, Song, Shiji, Shi, Humphrey, Huang, Gao

论文摘要

最近的研究表明，减少时间和空间冗余都是有效的视频识别方法的有效方法，例如将大多数计算分配给与任务相关的框架或每个帧中最有价值的图像区域。但是，在大多数现有的作品中，任何一种类型的冗余通常都以其他缺失建模。本文探讨了在最近提出的ADAFOCUSV2算法之上的时空动态计算的统一配方，从而有助于改进的ADAFOCUSV3框架。我们的方法仅在一些小但有益的3D视频立方体上激活昂贵的高容量网络来降低计算成本。这些立方体是根据框架高度，宽度和视频持续时间形成的空间裁剪的，而它们的位置则以每个样本为基础用轻加权的策略网络自适应地确定。在测试时，与每个视频相对应的立方体的数量是动态配置的，即，对视频立方体进行顺序处理，直到产生足够可靠的预测为止。值得注意的是，Adafocusv3可以通过近似不可分割的种植操作和深度特征的插值来有效地训练。六个基准数据集（即ActivityNet，FCVID，Mini-Kinetics，Some Someshing V1＆V2和Viving48）上的广泛经验结果表明，我们的模型比竞争性基线要高得多。

Recent research has revealed that reducing the temporal and spatial redundancy are both effective approaches towards efficient video recognition, e.g., allocating the majority of computation to a task-relevant subset of frames or the most valuable image regions of each frame. However, in most existing works, either type of redundancy is typically modeled with another absent. This paper explores the unified formulation of spatial-temporal dynamic computation on top of the recently proposed AdaFocusV2 algorithm, contributing to an improved AdaFocusV3 framework. Our method reduces the computational cost by activating the expensive high-capacity network only on some small but informative 3D video cubes. These cubes are cropped from the space formed by frame height, width, and video duration, while their locations are adaptively determined with a light-weighted policy network on a per-sample basis. At test time, the number of the cubes corresponding to each video is dynamically configured, i.e., video cubes are processed sequentially until a sufficiently reliable prediction is produced. Notably, AdaFocusV3 can be effectively trained by approximating the non-differentiable cropping operation with the interpolation of deep features. Extensive empirical results on six benchmark datasets (i.e., ActivityNet, FCVID, Mini-Kinetics, Something-Something V1&V2 and Diving48) demonstrate that our model is considerably more efficient than competitive baselines.

下载PDF全文

下载文献需遵守相关版权规定

论文标题