远程视频理解的时间汇总表示

论文标题

远程视频理解的时间汇总表示

Temporal Aggregate Representations for Long-Range Video Understanding

论文作者

Sener, Fadime, Singhania, Dipika, Yao, Angela

论文摘要

未来的预测，尤其是在远程视频中，需要从当前和过去的观察中进行推理。在这项工作中，我们通过灵活的多粒子时间聚集框架解决了时间范围，缩放和语义抽象水平的问题。我们表明，可以通过最大程度的吸收和注意力等简单的技术在下一步和密集的期望中实现最新技术。为了证明我们的模型的预期功能，我们在早餐，50萨拉德和Epic-kitchens数据集上进行了实验，在这里我们实现了最新的结果。随着最小的修改，我们的模型也可以扩展以进行视频细分和动作识别。

Future prediction, especially in long-range videos, requires reasoning from current and past observations. In this work, we address questions of temporal extent, scaling, and level of semantic abstraction with a flexible multi-granular temporal aggregation framework. We show that it is possible to achieve state of the art in both next action and dense anticipation with simple techniques such as max-pooling and attention. To demonstrate the anticipation capabilities of our model, we conduct experiments on Breakfast, 50Salads, and EPIC-Kitchens datasets, where we achieve state-of-the-art results. With minimal modifications, our model can also be extended for video segmentation and action recognition.

下载PDF全文

下载文献需遵守相关版权规定

论文标题