零拍动作识别的端到端语义视频变压器

论文标题

零拍动作识别的端到端语义视频变压器

End-to-End Semantic Video Transformer for Zero-Shot Action Recognition

论文作者

Doshi, Keval, Yilmaz, Yasin

论文摘要

尽管视频动作识别一直是研究的积极研究领域，但零拍动的识别直到最近才开始受到关注。在这项工作中，我们提出了一种新型的端到端训练的变压器模型，该模型能够有效地捕获远距离时空依赖性，与使用3D-CNN的现有方法相反。此外，为了解决现有的有关类的共同歧义，可以被认为是以前看不见的，我们提出了一个新的实验设置，该设置满足了零摄像的学习前提，以避免培训和测试课程之间的重叠，以供行动识别。所提出的方法在UCF-101，HMDB-51和ActivityNET数据集上的TOP-1准确性方面显着优于零拍动识别的艺术状态。代码和提议的实验设置可在GitHub中获得：https：//github.com/secure-and-intelligent-systems-lab/semanticvideotransformer

While video action recognition has been an active area of research for several years, zero-shot action recognition has only recently started gaining traction. In this work, we propose a novel end-to-end trained transformer model which is capable of capturing long range spatiotemporal dependencies efficiently, contrary to existing approaches which use 3D-CNNs. Moreover, to address a common ambiguity in the existing works about classes that can be considered as previously unseen, we propose a new experimentation setup that satisfies the zero-shot learning premise for action recognition by avoiding overlap between the training and testing classes. The proposed approach significantly outperforms the state of the arts in zero-shot action recognition in terms of the the top-1 accuracy on UCF-101, HMDB-51 and ActivityNet datasets. The code and proposed experimentation setup are available in GitHub: https://github.com/Secure-and-Intelligent-Systems-Lab/SemanticVideoTransformer

下载PDF全文

下载文献需遵守相关版权规定

论文标题