论文标题
以视觉变压器为中心的以人为中心的时空视频接地
Human-centric Spatio-Temporal Video Grounding With Visual Transformers
论文作者
论文摘要
在这项工作中,我们介绍了一项新颖的任务-Humancentric时空视频接地(HC-STVG)。与图像或视频中现有的参考表达任务(通过关注人类)不同,HC-STVG旨在根据给定的纹理描述从未修剪的视频中定位目标人员的时空管。此任务很有用,尤其是对于医疗保健和与安全相关的应用程序,在该应用程序中,监视视频可能非常长,但在特定时间段内只有一个特定的人。 HC-STVG是一项视频接地任务,需要空间(WHERE)和时间(何时)本地化。不幸的是,现有的接地方法无法很好地处理此任务。我们通过提出一种使用Visual Transformers(STGVT)的有效基线方法来解决此任务,该方法利用视觉变压器来提取交叉模式表示,以进行视频句子匹配和时间定位。为了促进这项任务,我们还贡献了一个HC-STVG数据集,该数据集由复杂的多人场景中的5,660个视频句子对组成。具体而言,每个视频持续20秒,与一个自然查询句子配对,平均为17.25个字。在此数据集上进行了广泛的实验,证明了新提出的方法的表现优于现有基线方法。
In this work, we introduce a novel task - Humancentric Spatio-Temporal Video Grounding (HC-STVG). Unlike the existing referring expression tasks in images or videos, by focusing on humans, HC-STVG aims to localize a spatiotemporal tube of the target person from an untrimmed video based on a given textural description. This task is useful, especially for healthcare and security-related applications, where the surveillance videos can be extremely long but only a specific person during a specific period of time is concerned. HC-STVG is a video grounding task that requires both spatial (where) and temporal (when) localization. Unfortunately, the existing grounding methods cannot handle this task well. We tackle this task by proposing an effective baseline method named Spatio-Temporal Grounding with Visual Transformers (STGVT), which utilizes Visual Transformers to extract cross-modal representations for video-sentence matching and temporal localization. To facilitate this task, we also contribute an HC-STVG dataset consisting of 5,660 video-sentence pairs on complex multi-person scenes. Specifically, each video lasts for 20 seconds, pairing with a natural query sentence with an average of 17.25 words. Extensive experiments are conducted on this dataset, demonstrating the newly-proposed method outperforms the existing baseline methods.