论文标题
评估视频段落标题中连贯性的话语分析
Discourse Analysis for Evaluating Coherence in Video Paragraph Captions
论文作者
论文摘要
视频段落字幕是自动生成视频动作的连贯段落描述的任务。先前的语言研究表明,自然语言文本的连贯性是由其话语结构和关系反映的。但是,现有的视频字幕方法通过将它们仅与人类段落注释进行比较而无法理解潜在的话语结构来评估生成的段落的连贯性。在加州大学洛杉矶分校(UCLA),我们目前正在探索一个基于话语的新型框架,以评估视频段落的连贯性。我们方法的核心是视频的话语表示,这有助于建模以视频连贯性为条件的段落的连贯性。我们还介绍了Disnet,这是一个新颖的数据集,其中包含3000个视频及其段落的拟议视觉话语注释。我们的实验结果表明,所提出的框架评估视频段落的连贯性明显优于所有基线方法。我们认为,许多其他多学科人工智能问题,例如视觉对话和视觉讲故事,也将从提议的视觉话语框架和DISNET数据集中受益匪浅。
Video paragraph captioning is the task of automatically generating a coherent paragraph description of the actions in a video. Previous linguistic studies have demonstrated that coherence of a natural language text is reflected by its discourse structure and relations. However, existing video captioning methods evaluate the coherence of generated paragraphs by comparing them merely against human paragraph annotations and fail to reason about the underlying discourse structure. At UCLA, we are currently exploring a novel discourse based framework to evaluate the coherence of video paragraphs. Central to our approach is the discourse representation of videos, which helps in modeling coherence of paragraphs conditioned on coherence of videos. We also introduce DisNet, a novel dataset containing the proposed visual discourse annotations of 3000 videos and their paragraphs. Our experiment results have shown that the proposed framework evaluates coherence of video paragraphs significantly better than all the baseline methods. We believe that many other multi-discipline Artificial Intelligence problems such as Visual Dialog and Visual Storytelling would also greatly benefit from the proposed visual discourse framework and the DisNet dataset.