论文标题
EGO4D长期行动预期的视频 +剪辑基线
Video + CLIP Baseline for Ego4D Long-term Action Anticipation
论文作者
论文摘要
在本报告中,我们介绍了对图像文本模型的适应,以进行长期行动预期。我们的视频 +剪辑框架利用了大规模训练的配对图像文本模型:剪辑和视频编码器慢速网络。剪辑嵌入提供了对与操作相关的对象的细粒度理解,而慢速网络则负责在几帧的视频剪辑中对时间信息进行建模。我们表明,从两个编码器获得的功能相互互补,因此在长期行动预期的任务上,在EGO4D上的基线表现优于基线。我们的代码可在github.com/srijandas07/clip_baseline_lta_ego4d上找到。
In this report, we introduce our adaptation of image-text models for long-term action anticipation. Our Video + CLIP framework makes use of a large-scale pre-trained paired image-text model: CLIP and a video encoder Slowfast network. The CLIP embedding provides fine-grained understanding of objects relevant for an action whereas the slowfast network is responsible for modeling temporal information within a video clip of few frames. We show that the features obtained from both encoders are complementary to each other, thus outperforming the baseline on Ego4D for the task of long-term action anticipation. Our code is available at github.com/srijandas07/clip_baseline_LTA_Ego4d.