X-Pool：文本视频检索的跨模式语言视频录音

论文标题

X-Pool：文本视频检索的跨模式语言视频录音

X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval

论文作者

Gorti, Satya Krishna, Vouitsis, Noel, Ma, Junwei, Golestan, Keyvan, Volkovs, Maksims, Garg, Animesh, Yu, Guangwei

论文摘要

在文本视频检索中，目的是学习文本和视频之间的跨模式相似性函数，该视频对相关的文本视频对对比对比对更高。但是，视频本质上表达了比文本更广泛的信息。取而代之的是，文本通常捕获整个视频的子区域，并且在语义上与视频中的某些帧最相似。因此，对于给定的文本，检索模型应集中在文本上最相似的视频子区域，以进行更相关的比较。但是，大多数现有作品汇总了整个视频，而无需直接考虑文本。常见的文本不合时宜的聚合方案包括帧上的平均泵或自我注意，但是这些可能会编码给定文本中未描述的误导性视觉信息。为了解决这个问题，我们提出了一个称为X-Pool的跨模式注意模型，该模型在视频的框架和视频框架之间存在原因。我们的核心机制是缩放的点产品的关注，以便文本使用其语义上最相似的框架。然后，我们生成一个汇总的视频表示形式，该视频表示在框架上的注意力重量上。我们在MSR-VTT，MSVD和LSMDC的三个基准数据集上评估了我们的方法，从而获得了新的最新结果，在Recce@1的相对改进中最多可获得12％。因此，我们的发现突出了联合文本视频推理的重要性，根据文本提取重要的视觉提示。完整的代码和演示可以在以下网址找到：https：//layer6ai-labs.github.io/xpool/

In text-video retrieval, the objective is to learn a cross-modal similarity function between a text and a video that ranks relevant text-video pairs higher than irrelevant pairs. However, videos inherently express a much wider gamut of information than texts. Instead, texts often capture sub-regions of entire videos and are most semantically similar to certain frames within videos. Therefore, for a given text, a retrieval model should focus on the text's most semantically similar video sub-regions to make a more relevant comparison. Yet, most existing works aggregate entire videos without directly considering text. Common text-agnostic aggregations schemes include mean-pooling or self-attention over the frames, but these are likely to encode misleading visual information not described in the given text. To address this, we propose a cross-modal attention model called X-Pool that reasons between a text and the frames of a video. Our core mechanism is a scaled dot product attention for a text to attend to its most semantically similar frames. We then generate an aggregated video representation conditioned on the text's attention weights over the frames. We evaluate our method on three benchmark datasets of MSR-VTT, MSVD and LSMDC, achieving new state-of-the-art results by up to 12% in relative improvement in Recall@1. Our findings thereby highlight the importance of joint text-video reasoning to extract important visual cues according to text. Full code and demo can be found at: https://layer6ai-labs.github.io/xpool/

下载PDF全文

下载文献需遵守相关版权规定

论文标题