场景文本视觉问题回答的多模式网格功能和单元指针

论文标题

场景文本视觉问题回答的多模式网格功能和单元指针

Multimodal grid features and cell pointers for Scene Text Visual Question Answering

论文作者

Gómez, Lluís, Biten, Ali Furkan, Tito, Rubèn, Mafla, Andrés, Rusiñol, Marçal, Valveny, Ernest, Karatzas, Dimosthenis

论文摘要

本文为场景文本视觉问题的任务提供了一个新模型，其中只有通过阅读和理解其中存在的场景文本才能回答有关给定图像的问题。所提出的模型基于一个注意机制，该机制会遵循以该问题为条件的多模式特征，从而使其可以共同理解场景中的文本和视觉方式。该注意模块在多模式空间特征网格上的输出权重解释为图像的某个空间位置包含答案文本的概率。我们的实验证明了两个标准数据集中的竞争性能。此外，本文基于人类绩效研究提供了对ST-VQA数据集的新颖分析。

This paper presents a new model for the task of scene text visual question answering, in which questions about a given image can only be answered by reading and understanding scene text that is present in it. The proposed model is based on an attention mechanism that attends to multi-modal features conditioned to the question, allowing it to reason jointly about the textual and visual modalities in the scene. The output weights of this attention module over the grid of multi-modal spatial features are interpreted as the probability that a certain spatial location of the image contains the answer text the to the given question. Our experiments demonstrate competitive performance in two standard datasets. Furthermore, this paper provides a novel analysis of the ST-VQA dataset based on a human performance study.

下载PDF全文

下载文献需遵守相关版权规定

论文标题