在医学教学视频中朝着视觉促进的时间回答基础

论文标题

在医学教学视频中朝着视觉促进的时间回答基础

Towards Visual-Prompt Temporal Answering Grounding in Medical Instructional Video

论文作者

Li, Bin, Weng, Yixuan, Sun, Bin, Li, Shutao

论文摘要

视频（TAGV）中的时间回答接地是一项自然衍生自视频中的时间句子的新任务（TSGV）。给定一个未修剪的视频和文本问题，此任务旨在找到可以通过语义回答问题的视频的匹配跨度。现有方法倾向于通过基于视觉跨度的问题答案（QA）方法来制定TAGV任务，该方法通过与文本问题查询的视觉框架相匹配。但是，由于文本问题和视觉答案之间的语义特征的相关性较弱和巨大差距，因此在TAGV任务中采用了视觉跨度预测器的现有方法的性能很差。为了弥合这些差距，我们提出了一个视觉突出文本跨越本地化（VPTSL）方法，该方法引入了时间戳的字幕，作为对输入文本问题执行文本跨度本地化的段落，并将视觉亮点特征引入预先训练的语言模型（PLM）中，以增强关节语言表示。具体而言，上下文查询注意力用于在提取的文本和视觉特征之间执行跨模式相互作用。然后，通过视频文本突出显示视觉提示，可以获得高光特征。为了减轻文本和视觉特征之间的语义差异，我们通过编码问题，字幕以及PLM的促进的视觉亮点特征来设计文本跨度预测器。结果，制定了TAGV任务以预测与视觉答案相匹配的字幕跨度。关于MEDVIDQA的大量实验表明，所提出的VPTSL在MIOU方面以较大的利润率优于最先进的方法（SOTA）方法，这表明了所提出的Visual Prompt和文本跨度预测器的有效性。

The temporal answering grounding in the video (TAGV) is a new task naturally derived from temporal sentence grounding in the video (TSGV). Given an untrimmed video and a text question, this task aims at locating the matching span from the video that can semantically answer the question. Existing methods tend to formulate the TAGV task with a visual span-based question answering (QA) approach by matching the visual frame span queried by the text question. However, due to the weak correlations and huge gaps of the semantic features between the textual question and visual answer, existing methods adopting visual span predictor perform poorly in the TAGV task. To bridge these gaps, we propose a visual-prompt text span localizing (VPTSL) method, which introduces the timestamped subtitles as a passage to perform the text span localization for the input text question, and prompts the visual highlight features into the pre-trained language model (PLM) for enhancing the joint semantic representations. Specifically, the context query attention is utilized to perform cross-modal interaction between the extracted textual and visual features. Then, the highlight features are obtained through the video-text highlighting for the visual prompt. To alleviate semantic differences between textual and visual features, we design the text span predictor by encoding the question, the subtitles, and the prompted visual highlight features with the PLM. As a result, the TAGV task is formulated to predict the span of subtitles matching the visual answer. Extensive experiments on the medical instructional dataset, namely MedVidQA, show that the proposed VPTSL outperforms the state-of-the-art (SOTA) method by 28.36% in terms of mIOU with a large margin, which demonstrates the effectiveness of the proposed visual prompt and the text span predictor.

下载PDF全文

下载文献需遵守相关版权规定

论文标题