论文标题
视频中的时间句子接地:调查和未来方向
Temporal Sentence Grounding in Videos: A Survey and Future Directions
论文作者
论文摘要
视频(TSGV),\又名自然语言视频本地化(NLVL)或视频瞬间检索(VMR)中的时间句子接地(aka自然语言视频本地化(NLVL),旨在检索暂时的时刻,该时刻在语义上对应于未修剪视频中的语言查询。 TSGV连接计算机视觉和自然语言,引起了两个社区研究人员的极大关注。这项调查试图提供有关TSGV和当前研究状况以及未来研究方向的基本概念的摘要。作为背景,我们以教程样式介绍了TSGV中功能组件的共同结构:从原始视频和语言查询的功能提取,以回答对目标时刻的预测。然后,我们回顾了多模式理解和相互作用的技术,这是TSGV在两种方式之间有效排列的关键重点。我们构建了TSGV技术的分类法,并凭借其优点和劣势详细说明了不同类别的方法。最后,我们与当前的TSGV研究讨论问题,并分享有关有前途的研究方向的见解。
Temporal sentence grounding in videos (TSGV), \aka natural language video localization (NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that semantically corresponds to a language query from an untrimmed video. Connecting computer vision and natural language, TSGV has drawn significant attention from researchers in both communities. This survey attempts to provide a summary of fundamental concepts in TSGV and current research status, as well as future research directions. As the background, we present a common structure of functional components in TSGV, in a tutorial style: from feature extraction from raw video and language query, to answer prediction of the target moment. Then we review the techniques for multimodal understanding and interaction, which is the key focus of TSGV for effective alignment between the two modalities. We construct a taxonomy of TSGV techniques and elaborate the methods in different categories with their strengths and weaknesses. Lastly, we discuss issues with the current TSGV research and share our insights about promising research directions.