论文标题
零拍动作识别的端到端语义视频变压器
End-to-End Semantic Video Transformer for Zero-Shot Action Recognition
论文作者
论文摘要
尽管视频动作识别一直是研究的积极研究领域,但零拍动的识别直到最近才开始受到关注。在这项工作中,我们提出了一种新型的端到端训练的变压器模型,该模型能够有效地捕获远距离时空依赖性,与使用3D-CNN的现有方法相反。此外,为了解决现有的有关类的共同歧义,可以被认为是以前看不见的,我们提出了一个新的实验设置,该设置满足了零摄像的学习前提,以避免培训和测试课程之间的重叠,以供行动识别。所提出的方法在UCF-101,HMDB-51和ActivityNET数据集上的TOP-1准确性方面显着优于零拍动识别的艺术状态。代码和提议的实验设置可在GitHub中获得:https://github.com/secure-and-intelligent-systems-lab/semanticvideotransformer
While video action recognition has been an active area of research for several years, zero-shot action recognition has only recently started gaining traction. In this work, we propose a novel end-to-end trained transformer model which is capable of capturing long range spatiotemporal dependencies efficiently, contrary to existing approaches which use 3D-CNNs. Moreover, to address a common ambiguity in the existing works about classes that can be considered as previously unseen, we propose a new experimentation setup that satisfies the zero-shot learning premise for action recognition by avoiding overlap between the training and testing classes. The proposed approach significantly outperforms the state of the arts in zero-shot action recognition in terms of the the top-1 accuracy on UCF-101, HMDB-51 and ActivityNet datasets. The code and proposed experimentation setup are available in GitHub: https://github.com/Secure-and-Intelligent-Systems-Lab/SemanticVideoTransformer