整合：姿势驱动的功能集成，以符合视频中强大的人类动作识别

论文标题

整合：姿势驱动的功能集成，以符合视频中强大的人类动作识别

IntegralAction: Pose-driven Feature Integration for Robust Human Action Recognition in Videos

论文作者

Moon, Gyeongsik, Kwon, Heeseung, Lee, Kyoung Mu, Cho, Minsu

论文摘要

大多数当前的动作识别方法通过将整个图像区域的RGB序列作为输入来严重依赖外观信息。尽管有效利用人类的上下文信息，例如人类的外观和场景类别，但它们很容易被神秘的动作视频所欺骗，这些视频与目标动作不完全匹配。相比之下，基于姿势的方法仅将人类骨骼作为输入，遭受人类姿势本身的不准确姿势估计或歧义。整合这两种方法已被证明是不平凡的。培训具有外观和姿势的模型最终对外观有很大的偏见，并且不能很好地概括看不到的视频。为了解决这个问题，我们建议学习姿势驱动的特征集成，该集成通过观察姿势特征动态结合外观和姿势流。主要思想是让姿势流根据给定的姿势信息是否可靠，在集成中使用多少外观信息。我们表明，所提出的整合层面可以在跨文本和外观外面的动作视频数据集中实现高度稳健的性能。这些代码可在https://github.com/mks0601/integralaction_release中找到。

Most current action recognition methods heavily rely on appearance information by taking an RGB sequence of entire image regions as input. While being effective in exploiting contextual information around humans, e.g., human appearance and scene category, they are easily fooled by out-of-context action videos where the contexts do not exactly match with target actions. In contrast, pose-based methods, which take a sequence of human skeletons only as input, suffer from inaccurate pose estimation or ambiguity of human pose per se. Integrating these two approaches has turned out to be non-trivial; training a model with both appearance and pose ends up with a strong bias towards appearance and does not generalize well to unseen videos. To address this problem, we propose to learn pose-driven feature integration that dynamically combines appearance and pose streams by observing pose features on the fly. The main idea is to let the pose stream decide how much and which appearance information is used in integration based on whether the given pose information is reliable or not. We show that the proposed IntegralAction achieves highly robust performance across in-context and out-of-context action video datasets. The codes are available in https://github.com/mks0601/IntegralAction_RELEASE.

下载PDF全文

下载文献需遵守相关版权规定

论文标题