猫头鹰（观察，观看，听）：在以自我为中心视频中本地化动作的视听时间上下文

论文标题

猫头鹰（观察，观看，听）：在以自我为中心视频中本地化动作的视听时间上下文

OWL (Observe, Watch, Listen): Audiovisual Temporal Context for Localizing Actions in Egocentric Videos

论文作者

Ramazanova, Merey, Escorcia, Victor, Heilbron, Fabian Caba, Zhao, Chen, Ghanem, Bernard

论文摘要

以自我为中心的视频从第一人称角度捕获人类活动的序列，并可以提供丰富的多模式信号。但是，大多数当前的本地化方法使用第三人称视频，仅包含视觉信息。在这项工作中，我们深入研究了视听环境在检测以自我为中心视频中的动作的有效性，并通过观察，观看和聆听（OWL）引入了一种简单的方法。猫头鹰利用视听信息和上下文来进行以自我为中心的时间动作本地化（TAL）。我们在两个大规模数据集（Epic-kitchens and Gonage）中验证了我们的方法。广泛的实验证明了视听时间环境的相关性。也就是说，在上述数据集中，我们将本地化性能（MAP）提高 +2.23％和 +3.35％。

Egocentric videos capture sequences of human activities from a first-person perspective and can provide rich multimodal signals. However, most current localization methods use third-person videos and only incorporate visual information. In this work, we take a deep look into the effectiveness of audiovisual context in detecting actions in egocentric videos and introduce a simple-yet-effective approach via Observing, Watching, and Listening (OWL). OWL leverages audiovisual information and context for egocentric temporal action localization (TAL). We validate our approach in two large-scale datasets, EPIC-Kitchens, and HOMAGE. Extensive experiments demonstrate the relevance of the audiovisual temporal context. Namely, we boost the localization performance (mAP) over visual-only models by +2.23% and +3.35% in the above datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题