论文标题
视频移动格式:具有高效全球时空建模的视频识别
Video Mobile-Former: Video Recognition with Efficient Global Spatial-temporal Modeling
论文作者
论文摘要
基于变压器的模型已在主要的视频识别基准上取得了最佳性能。与基于CNN的模型相比,这些模型受益于自我发挥机制,显示出更强的长期依赖性能力。但是,大量的计算开销是由于自我发作的二次复杂性在大量令牌之上,限制了现有的视频变压器在具有有限资源(例如移动设备)的应用程序中的使用。在本文中,我们将移动格式扩展到视频移动格式,该版本将视频体系结构分解为轻量级的3D-CNN,用于本地上下文建模,并以并行方式将变压器模块用于全局交互建模。为了避免通过计算视频中大量本地贴片之间的自我注意力而产生的重大计算成本,我们建议在变形金刚中使用很少的全球令牌(例如6),将整个视频中的整个视频用于使用交叉注意机制的3D-CNN交换信息。通过有效的全球空间建模,视频移动形式显着提高了替代轻型基线的视频识别性能,并且在各种视频识别任务上,低绒布式的低FLOP制度的其他基于有效的CNN模型从500m到6G的总鞋类。值得注意的是,视频移动格式是第一个基于变压器的视频模型,它限制了1G失败范围内的计算预算。
Transformer-based models have achieved top performance on major video recognition benchmarks. Benefiting from the self-attention mechanism, these models show stronger ability of modeling long-range dependencies compared to CNN-based models. However, significant computation overheads, resulted from the quadratic complexity of self-attention on top of a tremendous number of tokens, limit the use of existing video transformers in applications with limited resources like mobile devices. In this paper, we extend Mobile-Former to Video Mobile-Former, which decouples the video architecture into a lightweight 3D-CNNs for local context modeling and a Transformer modules for global interaction modeling in a parallel fashion. To avoid significant computational cost incurred by computing self-attention between the large number of local patches in videos, we propose to use very few global tokens (e.g., 6) for a whole video in Transformers to exchange information with 3D-CNNs with a cross-attention mechanism. Through efficient global spatial-temporal modeling, Video Mobile-Former significantly improves the video recognition performance of alternative lightweight baselines, and outperforms other efficient CNN-based models at the low FLOP regime from 500M to 6G total FLOPs on various video recognition tasks. It is worth noting that Video Mobile-Former is the first Transformer-based video model which constrains the computational budget within 1G FLOPs.