可以改组视频益处时间偏见问题：一个新颖的时间基础培训框架

论文标题

可以改组视频益处时间偏见问题：一个新颖的时间基础培训框架

Can Shuffling Video Benefit Temporal Bias Problem: A Novel Training Framework for Temporal Grounding

论文作者

Hao, Jiachang, Sun, Haifeng, Ren, Pengfei, Wang, Jingyu, Qi, Qi, Liao, Jianxin

论文摘要

时间基础旨在找到目标视频时刻，语义上与未修剪视频中给定的句子查询相对应。但是，最近的作品发现现有方法遇到了严重的时间偏见问题。这些方法并不是根据训练集中查询的时间偏见，而是基于视觉文本语义对齐的目标矩位置。为此，本文提出了一个新颖的培训框架，用于接地模型，以使用洗牌视频来解决时间偏见问题而不会失去接地精度。我们的框架介绍了两个辅助任务，即跨模式匹配和时间订单歧视，以促进接地模型培训。跨模式匹配任务利用了洗牌和原始视频之间的内容一致性迫使接地模型以挖掘视觉内容以匹配语义的查询。时间顺序歧视任务利用时间顺序的差异来加强对长期时间环境的理解。有关Charades-STA和活动网字幕的广泛实验证明了我们方法可以减轻对时间偏见的依赖并增强模型对不同时间分布的概括能力的有效性。代码可从https://github.com/haojc/shufflingvideosfortsg获得。

Temporal grounding aims to locate a target video moment that semantically corresponds to the given sentence query in an untrimmed video. However, recent works find that existing methods suffer a severe temporal bias problem. These methods do not reason the target moment locations based on the visual-textual semantic alignment but over-rely on the temporal biases of queries in training sets. To this end, this paper proposes a novel training framework for grounding models to use shuffled videos to address temporal bias problem without losing grounding accuracy. Our framework introduces two auxiliary tasks, cross-modal matching and temporal order discrimination, to promote the grounding model training. The cross-modal matching task leverages the content consistency between shuffled and original videos to force the grounding model to mine visual contents to semantically match queries. The temporal order discrimination task leverages the difference in temporal order to strengthen the understanding of long-term temporal contexts. Extensive experiments on Charades-STA and ActivityNet Captions demonstrate the effectiveness of our method for mitigating the reliance on temporal biases and strengthening the model's generalization ability against the different temporal distributions. Code is available at https://github.com/haojc/ShufflingVideosForTSG.

下载PDF全文

下载文献需遵守相关版权规定

论文标题