带有多项选择问题的桥接视频文本检索

论文标题

带有多项选择问题的桥接视频文本检索

Bridging Video-text Retrieval with Multiple Choice Questions

论文作者

Ge, Yuying, Ge, Yixiao, Liu, Xihui, Li, Dian, Shan, Ying, Qie, Xiaohu, Luo, Ping

论文摘要

预训练一个模型以学习可转移的视频文本表示以进行检索，这引起了近年来的很多关注。以前的主要作品主要采用两个单独的编码器进行有效检索，但忽略了视频和文本之间的本地关联。另一项研究系列使用联合编码器与文本进行交互，但是效率低，因为每个文本视频对都需要馈入模型。在这项工作中，我们启用了细粒度的视频交互，同时通过新颖的借口任务保持了高效率，被称为多项选择问题（MCQ），其中训练了一个参数模块桥接器，以回答文本功能通过视频功能求解文本功能所构建的“问题”。具体来说，我们利用文本的丰富语义（即名词和动词）来构建问题，可以通过培训视频编码器以捕获更多的区域内容和时间动态。以问题和答案的形式，可以正确建立本地视频文本功能之间的语义关联。桥接器可以被删除以进行下游检索，仅使用两个编码器提供了一个高效且灵活的模型。我们的方法在五个数据集中具有不同的实验设置（即零摄像和微调）的五个数据集中流行的文本到视频检索任务的最先进方法，包括HOWTOTO100M（一百万个视频）。我们进一步进行零拍动的识别，可以作为视频到文本检索，我们的方法也大大超过了其对应物。作为另一个好处，我们的方法通过有关下游任务的单模式的训练预训练视频（例如，通过线性评估）进行了较短的训练视频，从而实现了竞争成果。

Pre-training a model to learn transferable video-text representation for retrieval has attracted a lot of attention in recent years. Previous dominant works mainly adopt two separate encoders for efficient retrieval, but ignore local associations between videos and texts. Another line of research uses a joint encoder to interact video with texts, but results in low efficiency since each text-video pair needs to be fed into the model. In this work, we enable fine-grained video-text interactions while maintaining high efficiency for retrieval via a novel pretext task, dubbed as Multiple Choice Questions (MCQ), where a parametric module BridgeFormer is trained to answer the "questions" constructed by the text features via resorting to the video features. Specifically, we exploit the rich semantics of text (i.e., nouns and verbs) to build questions, with which the video encoder can be trained to capture more regional content and temporal dynamics. In the form of questions and answers, the semantic associations between local video-text features can be properly established. BridgeFormer is able to be removed for downstream retrieval, rendering an efficient and flexible model with only two encoders. Our method outperforms state-of-the-art methods on the popular text-to-video retrieval task in five datasets with different experimental setups (i.e., zero-shot and fine-tune), including HowTo100M (one million videos). We further conduct zero-shot action recognition, which can be cast as video-to-text retrieval, and our approach also significantly surpasses its counterparts. As an additional benefit, our method achieves competitive results with much shorter pre-training videos on single-modality downstream tasks, e.g., action recognition with linear evaluation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题