论文标题
结构化变分的跨刻画对应学习的组成时间接地
Compositional Temporal Grounding with Structured Variational Cross-Graph Correspondence Learning
论文作者
论文摘要
视频中的时间基础旨在定位一个目标视频段,该视频段与给定的查询句子相对应。得益于自然语言描述的语义多样性,时间基础允许活动的基础超出预定义的类别,并且近年来受到了越来越多的关注。语义多样性植根于语言学中的组成原则,在该原理中,可以通过以新颖的方式组合已知词(组成概括)来系统地描述新的语义。但是,当前的时间接地数据集并未专门测试组成概括性。为了系统地测量时间接地模型的组成性概括性,我们引入了一个新的组成时间接地任务,并构建了两个新的数据集拆分,即Charades-CG和ActivityNet-CG。在我们的新数据集拆分上评估最新方法,我们从经验上发现,它们无法通过新颖的单词组合来概括一下查询。为了应对这一挑战,我们提出了一个变异的跨刻画推理框架,该框架将视频和语言明确分解为多个结构化的层次结构,并在其中学习细粒度的语义通信。实验说明了我们方法的优势构图概括性。这项工作的存储库是在https://github.com/yyjmjc/ coptositional-temporal-grounding。
Temporal grounding in videos aims to localize one target video segment that semantically corresponds to a given query sentence. Thanks to the semantic diversity of natural language descriptions, temporal grounding allows activity grounding beyond pre-defined classes and has received increasing attention in recent years. The semantic diversity is rooted in the principle of compositionality in linguistics, where novel semantics can be systematically described by combining known words in novel ways (compositional generalization). However, current temporal grounding datasets do not specifically test for the compositional generalizability. To systematically measure the compositional generalizability of temporal grounding models, we introduce a new Compositional Temporal Grounding task and construct two new dataset splits, i.e., Charades-CG and ActivityNet-CG. Evaluating the state-of-the-art methods on our new dataset splits, we empirically find that they fail to generalize to queries with novel combinations of seen words. To tackle this challenge, we propose a variational cross-graph reasoning framework that explicitly decomposes video and language into multiple structured hierarchies and learns fine-grained semantic correspondence among them. Experiments illustrate the superior compositional generalizability of our approach. The repository of this work is at https://github.com/YYJMJC/ Compositional-Temporal-Grounding.