论文标题
DICORFORFERER:在变压器方法中的指示注意力识别强大的动作识别
DirecFormer: A Directed Attention in Transformer Approach to Robust Action Recognition
论文作者
论文摘要
人类行动认可最近已成为计算机视觉社区中流行的研究主题之一。已经提出了各种基于3D-CNN的方法,以解决视频动作识别任务中的空间和时间维度,并具有竞争性结果。但是,这些方法遭受了一些基本的局限性,例如缺乏鲁棒性和泛化,例如,视频框架的时间顺序如何影响识别结果?这项工作提出了一个新颖的基于端到端变压器的直接注意力(Dicterformer)框架,以实现强大的动作识别。该方法对基于变压器的方法进行了简单但新颖的视角,以了解正确的序列动作顺序。因此,这项工作的贡献是三倍。首先,我们将有序的时间学习问题的问题介绍给行动识别问题。其次,引入了一种新的定向注意机制,以理解并以正确的顺序对人类行为的关注。第三,我们介绍了包括订单和类的行动序列建模中的条件依赖关系。与最近的动作识别方法相比,所提出的方法始终取得了最先进的结果(SOTA)结果,即在三个标准的大规模基准(即Jester,Kinetics-400和Sothings-Something-some-Something-v2)上获得的动作识别方法。
Human action recognition has recently become one of the popular research topics in the computer vision community. Various 3D-CNN based methods have been presented to tackle both the spatial and temporal dimensions in the task of video action recognition with competitive results. However, these methods have suffered some fundamental limitations such as lack of robustness and generalization, e.g., how does the temporal ordering of video frames affect the recognition results? This work presents a novel end-to-end Transformer-based Directed Attention (DirecFormer) framework for robust action recognition. The method takes a simple but novel perspective of Transformer-based approach to understand the right order of sequence actions. Therefore, the contributions of this work are three-fold. Firstly, we introduce the problem of ordered temporal learning issues to the action recognition problem. Secondly, a new Directed Attention mechanism is introduced to understand and provide attentions to human actions in the right order. Thirdly, we introduce the conditional dependency in action sequence modeling that includes orders and classes. The proposed approach consistently achieves the state-of-the-art (SOTA) results compared with the recent action recognition methods, on three standard large-scale benchmarks, i.e. Jester, Kinetics-400 and Something-Something-V2.