论文标题
以自我为中心的视频任务翻译
Egocentric Video Task Translation
论文作者
论文摘要
通常会孤立地处理不同的视频理解任务,即使使用不同类型的策划数据(例如,在一个数据集中对运动进行分类,在另一个数据集中跟踪动物)。但是,在可穿戴的摄像机中,与周围世界相互互动的人的身临其境的自我中心观点提出了一个相互联系的视频理解任务的网络 - 手动对象的操纵,空间中的导航或人类人类的互动 - 由人的目标驱动。我们认为这需要一种更统一的方法。我们提出了EGOTASK翻译(EGOT2),该翻译收集了对单独任务进行优化的模型集合,并学会了一次转换其输出,以一次或所有这些或全部的性能提高性能。与传统的转移或多任务学习不同,EGOT2的翻转设计需要单独的特定任务主链和在所有任务中共享的任务转换器,该任务甚至在异质任务之间捕获协同作用,并减轻任务竞争。在EGO4D的各种视频任务上展示了我们的模型,我们在现有转移范式上展示了它的优势,并在EGO4D 2022基准挑战的四个基准中取得了排名最高的结果。
Different video understanding tasks are typically treated in isolation, and even with distinct types of curated data (e.g., classifying sports in one dataset, tracking animals in another). However, in wearable cameras, the immersive egocentric perspective of a person engaging with the world around them presents an interconnected web of video understanding tasks -- hand-object manipulations, navigation in the space, or human-human interactions -- that unfold continuously, driven by the person's goals. We argue that this calls for a much more unified approach. We propose EgoTask Translation (EgoT2), which takes a collection of models optimized on separate tasks and learns to translate their outputs for improved performance on any or all of them at once. Unlike traditional transfer or multi-task learning, EgoT2's flipped design entails separate task-specific backbones and a task translator shared across all tasks, which captures synergies between even heterogeneous tasks and mitigates task competition. Demonstrating our model on a wide array of video tasks from Ego4D, we show its advantages over existing transfer paradigms and achieve top-ranked results on four of the Ego4D 2022 benchmark challenges.