Ruart：一种以文本为中心的新颖的基于文本的视觉问题回答的解决方案

论文标题

Ruart：一种以文本为中心的新颖的基于文本的视觉问题回答的解决方案

RUArt: A Novel Text-Centered Solution for Text-Based Visual Question Answering

论文作者

Jin, Zan-Xia, Wu, Heran, Yang, Chun, Zhou, Fang, Qin, Jingyan, Xiao, Lei, Yin, Xu-Cheng

论文摘要

基于文本的视觉问题回答（VQA）需要在图像中读取和理解文本以正确回答给定的问题。但是，大多数当前方法只需添加从图像中提取的光学字符识别（OCR）令牌中提取到VQA模型中，而无需考虑OCR令牌的上下文信息并挖掘OCR令牌和场景对象之间的关系。在本文中，我们为基于文本的VQA提出了一种以文本为中心的方法，称为Ruart（阅读，理解和回答相关文本）。将图像和一个问题作为输入，Ruart首先读取图像并获取文本和场景对象。然后，它了解场景上下文中的问题，ocred文本和对象，并进一步挖掘出它们之间的关系。最后，它通过文本语义匹配和推理回答给定问题的相关文本。我们评估了两个基于文本的VQA基准（ST-VQA和TextVQA）的RUART，并进行了广泛的消融研究，以探讨Ruart有效性背后的原因。实验结果表明，我们的方法可以有效地探索文本的上下文信息，并挖掘文本和对象之间的稳定关系。

Text-based visual question answering (VQA) requires to read and understand text in an image to correctly answer a given question. However, most current methods simply add optical character recognition (OCR) tokens extracted from the image into the VQA model without considering contextual information of OCR tokens and mining the relationships between OCR tokens and scene objects. In this paper, we propose a novel text-centered method called RUArt (Reading, Understanding and Answering the Related Text) for text-based VQA. Taking an image and a question as input, RUArt first reads the image and obtains text and scene objects. Then, it understands the question, OCRed text and objects in the context of the scene, and further mines the relationships among them. Finally, it answers the related text for the given question through text semantic matching and reasoning. We evaluate our RUArt on two text-based VQA benchmarks (ST-VQA and TextVQA) and conduct extensive ablation studies for exploring the reasons behind RUArt's effectiveness. Experimental results demonstrate that our method can effectively explore the contextual information of the text and mine the stable relationships between the text and objects.

下载PDF全文

下载文献需遵守相关版权规定

论文标题