EGO4D长期行动预期的视频 +剪辑基线

论文标题

EGO4D长期行动预期的视频 +剪辑基线

Video + CLIP Baseline for Ego4D Long-term Action Anticipation

论文作者

Das, Srijan, Ryoo, Michael S.

论文摘要

在本报告中，我们介绍了对图像文本模型的适应，以进行长期行动预期。我们的视频 +剪辑框架利用了大规模训练的配对图像文本模型：剪辑和视频编码器慢速网络。剪辑嵌入提供了对与操作相关的对象的细粒度理解，而慢速网络则负责在几帧的视频剪辑中对时间信息进行建模。我们表明，从两个编码器获得的功能相互互补，因此在长期行动预期的任务上，在EGO4D上的基线表现优于基线。我们的代码可在github.com/srijandas07/clip_baseline_lta_ego4d上找到。

In this report, we introduce our adaptation of image-text models for long-term action anticipation. Our Video + CLIP framework makes use of a large-scale pre-trained paired image-text model: CLIP and a video encoder Slowfast network. The CLIP embedding provides fine-grained understanding of objects relevant for an action whereas the slowfast network is responsible for modeling temporal information within a video clip of few frames. We show that the features obtained from both encoders are complementary to each other, thus outperforming the baseline on Ego4D for the task of long-term action anticipation. Our code is available at github.com/srijandas07/clip_baseline_LTA_Ego4d.

下载PDF全文

下载文献需遵守相关版权规定

论文标题