视觉绑架推理

论文标题

视觉绑架推理

Visual Abductive Reasoning

论文作者

Liang, Chen, Wang, Wenguan, Zhou, Tianfei, Yang, Yi

论文摘要

绑架推理为部分观察提供了最有可能的解释。尽管绑架经常用于人类日常推理中，但在计算机视觉文献中很少探索它。在本文中，我们提出了一项新任务和数据集，视觉绑架推理（VAR），用于检查日常视觉情况下机器智能的绑架推理能力。给定一组不完整的视觉事件，需要AI系统不仅要描述观察到的内容，还需要推断可以最好地解释视觉前提的假设。根据我们的大规模VAR数据集，我们设计了一个强大的基线模型，推理器（因果关系和束缚推理变压器）。首先，为了捕获观测值的因果结构，编码器中采用了上下文化的定向位置嵌入策略，从而为前提和假设提供了歧视性表示。然后，将多个解码器级联成生成并逐步完善前提和假设句子。句子的预测得分用于指导级联推理程序中的跨句子信息流。我们的VAR基准测试结果表明，Reasoner超过了许多著名的视频语言模型，同时仍然远远落后于人类表现。预计这项工作将在推理 - 观察范式中促进未来的努力。

Abductive reasoning seeks the likeliest possible explanation for partial observations. Although abduction is frequently employed in human daily reasoning, it is rarely explored in computer vision literature. In this paper, we propose a new task and dataset, Visual Abductive Reasoning (VAR), for examining abductive reasoning ability of machine intelligence in everyday visual situations. Given an incomplete set of visual events, AI systems are required to not only describe what is observed, but also infer the hypothesis that can best explain the visual premise. Based on our large-scale VAR dataset, we devise a strong baseline model, Reasoner (causal-and-cascaded reasoning Transformer). First, to capture the causal structure of the observations, a contextualized directional position embedding strategy is adopted in the encoder, that yields discriminative representations for the premise and hypothesis. Then, multiple decoders are cascaded to generate and progressively refine the premise and hypothesis sentences. The prediction scores of the sentences are used to guide cross-sentence information flow in the cascaded reasoning procedure. Our VAR benchmarking results show that Reasoner surpasses many famous video-language models, while still being far behind human performance. This work is expected to foster future efforts in the reasoning-beyond-observation paradigm.

下载PDF全文

下载文献需遵守相关版权规定

论文标题