视频字幕的视觉常识性意识表示网络

论文标题

视频字幕的视觉常识性意识表示网络

Visual Commonsense-aware Representation Network for Video Captioning

论文作者

Zeng, Pengpeng, Zhang, Haonan, Gao, Lianli, Li, Xiangpeng, Qian, Jin, Shen, Heng Tao

论文摘要

为视频（即视频字幕）生成连续的描述，需要充分利用视觉表示以及生成过程。现有的视频字幕方法着重于对时空表示及其关系进行推断进行探索。但是，这种方法仅利用视频本身中包含的表面关联，而没有考虑视频数据集中存在的固有的视觉常识知识，这可能会阻碍其知识认知能力以推理准确的描述。为了解决这个问题，我们提出了一种简单而有效的方法，称为视频字幕，称为Visual Commensens-Aware-Aware Aware表示网络（VCRN）。具体来说，我们构建了一个视频词典，即插件组件，该组件通过将总数据集中的所有视频功能聚集到多个聚类中心而获得的，而无需其他注释。每个中心隐式代表视频域中的视觉常识概念，在我们提出的视觉概念选择（VCS）中用于获得与视频相关的概念功能。接下来，提出了一个概念整合（CIG）来增强标题的生成。对三个公开视频字幕的基准测试：MSVD，MSR-VTT和VATEX进行了广泛的实验，证明我们的方法达到了最先进的性能，表明我们方法的有效性。此外，我们的方法被整合到现有的视频问题回答方法中，并改善了此性能，进一步显示了我们方法的概括。源代码已在https://github.com/zchoi/vcrn上发布。

Generating consecutive descriptions for videos, i.e., Video Captioning, requires taking full advantage of visual representation along with the generation process. Existing video captioning methods focus on making an exploration of spatial-temporal representations and their relationships to produce inferences. However, such methods only exploit the superficial association contained in the video itself without considering the intrinsic visual commonsense knowledge that existed in a video dataset, which may hinder their capabilities of knowledge cognitive to reason accurate descriptions. To address this problem, we propose a simple yet effective method, called Visual Commonsense-aware Representation Network (VCRN), for video captioning. Specifically, we construct a Video Dictionary, a plug-and-play component, obtained by clustering all video features from the total dataset into multiple clustered centers without additional annotation. Each center implicitly represents a visual commonsense concept in the video domain, which is utilized in our proposed Visual Concept Selection (VCS) to obtain a video-related concept feature. Next, a Conceptual Integration Generation (CIG) is proposed to enhance the caption generation. Extensive experiments on three publicly video captioning benchmarks: MSVD, MSR-VTT, and VATEX, demonstrate that our method reaches state-of-the-art performance, indicating the effectiveness of our method. In addition, our approach is integrated into the existing method of video question answering and improves this performance, further showing the generalization of our method. Source code has been released at https://github.com/zchoi/VCRN.

下载PDF全文

下载文献需遵守相关版权规定

论文标题