用于人类骨骼表示的层次自我监督的变压器学习

论文标题

用于人类骨骼表示的层次自我监督的变压器学习

Hierarchically Self-Supervised Transformer for Human Skeleton Representation Learning

论文作者

Chen, Yuxiao, Zhao, Long, Yuan, Jianbo, Tian, Yu, Xia, Zhaoyang, Geng, Shijie, Han, Ligong, Metaxas, Dimitris N.

论文摘要

尽管完全监督的人类骨架序列建模成功，但利用自我监管的预训练进行骨架序列表示学习仍然是一个活跃的领域，因为很难在大规模上获取特定于任务的骨骼注释。最近的研究重点是使用对比学习学习视频级别的时间和歧视性信息，但忽略了人类骨骼的层次空间 - 周期性。与视频级别的这种表面监督不同，我们提出了一种纳入基于层次变压器的骨骼骨骼序列编码器（HI-TRS）中的自我监督分层的预训练方案，以明确捕获空间，短期和长期的时间依赖性，并在框架，剪辑和视频水平上分别捕获。为了通过HI-TR评估提出的自我监督的预训练方案，我们进行了广泛的实验，涵盖了三个基于骨架的下游任务，包括动作识别，动作检测和运动预测。根据监督和半监督评估方案，我们的方法实现了最新的性能。此外，我们证明，在训练前阶段学到的先验知识具有针对不同下游任务的强大转移能力。

Despite the success of fully-supervised human skeleton sequence modeling, utilizing self-supervised pre-training for skeleton sequence representation learning has been an active field because acquiring task-specific skeleton annotations at large scales is difficult. Recent studies focus on learning video-level temporal and discriminative information using contrastive learning, but overlook the hierarchical spatial-temporal nature of human skeletons. Different from such superficial supervision at the video level, we propose a self-supervised hierarchical pre-training scheme incorporated into a hierarchical Transformer-based skeleton sequence encoder (Hi-TRS), to explicitly capture spatial, short-term, and long-term temporal dependencies at frame, clip, and video levels, respectively. To evaluate the proposed self-supervised pre-training scheme with Hi-TRS, we conduct extensive experiments covering three skeleton-based downstream tasks including action recognition, action detection, and motion prediction. Under both supervised and semi-supervised evaluation protocols, our method achieves the state-of-the-art performance. Additionally, we demonstrate that the prior knowledge learned by our model in the pre-training stage has strong transfer capability for different downstream tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题