关于3D人姿势估计的单眼视频的时间表示学习

论文标题

关于3D人姿势估计的单眼视频的时间表示学习

Temporal Representation Learning on Monocular Videos for 3D Human Pose Estimation

论文作者

Honari, Sina, Constantin, Victor, Rhodin, Helge, Salzmann, Mathieu, Fua, Pascal

论文摘要

在本文中，我们提出了一种无监督的特征提取方法，以捕获有关单眼视频的时间信息，在该视频中，我们在每个框架中检测和编码感兴趣的主题，并利用对比性的自我监视（CSS）学习来提取丰富的潜在载体。与其简单地将附近框架的潜在特征视为正对，而暂时的框架的潜在特征则像其他CSS方法一样，而是将每个潜在向量分解为一个时间变化的组件和时间不变的组件。然后，我们表明，仅对时间变化的特征应用对比损失，并鼓励在附近和客场之间逐渐过渡，同时还重建输入，提取丰富的时间特征，非常适合人类姿势估计。与标准CSS策略相比，我们的方法可将错误降低约50％，优于其他无监督的单视图方法，并匹配多视图技术的性能。当有2D姿势可用时，我们的方法可以提取更丰富的潜在特征并提高3D姿势估计精度，从而超过其他最先进的弱监督方法。

In this paper we propose an unsupervised feature extraction method to capture temporal information on monocular videos, where we detect and encode subject of interest in each frame and leverage contrastive self-supervised (CSS) learning to extract rich latent vectors. Instead of simply treating the latent features of nearby frames as positive pairs and those of temporally-distant ones as negative pairs as in other CSS approaches, we explicitly disentangle each latent vector into a time-variant component and a time-invariant one. We then show that applying contrastive loss only to the time-variant features and encouraging a gradual transition on them between nearby and away frames while also reconstructing the input, extract rich temporal features, well-suited for human pose estimation. Our approach reduces error by about 50% compared to the standard CSS strategies, outperforms other unsupervised single-view methods and matches the performance of multi-view techniques. When 2D pose is available, our approach can extract even richer latent features and improve the 3D pose estimation accuracy, outperforming other state-of-the-art weakly supervised methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题