论文标题
Bi-vldoc:双向视觉语言建模,用于视觉上富裕文档的理解
Bi-VLDoc: Bidirectional Vision-Language Modeling for Visually-Rich Document Understanding
论文作者
论文摘要
事实证明,多模式文档预训练的模型在各种视觉上富裕的文档理解(VRDU)任务中非常有效。尽管现有的文档预培训模型在VRDU的标准基准上取得了出色的性能,但它们建模和利用文档上的视觉和语言之间的相互作用的方式阻碍了他们无法获得更好的概括能力和更高的准确性。在这项工作中,我们研究了VRDU的视觉联合表示学习问题,主要是从监督信号的角度来看。具体而言,提出了一种称为BI-VLDOC的预训练范式,其中设计了双向视觉监督策略和视觉视觉混合注意机制,以充分探索并利用这两种方式之间的相互作用,以学习与Richer Sentics的强大跨模态文档表达。从学习丰富的跨模式文档表示中受益,Bi-vldoc显着提高了三个广泛使用的文档理解基准的最先进的表现,包括形式的理解(从85.14%到93.44%),收据信息提取(96.01%)(从96.01%到97.84%),以及文档分类(从96.08%到96.08%至97.08%至97.08%至97.08%至97.12%)。在文档Visual Qa上,BI-VLDOC与以前的单个模型方法相比,实现了最先进的性能。
Multi-modal document pre-trained models have proven to be very effective in a variety of visually-rich document understanding (VrDU) tasks. Though existing document pre-trained models have achieved excellent performance on standard benchmarks for VrDU, the way they model and exploit the interactions between vision and language on documents has hindered them from better generalization ability and higher accuracy. In this work, we investigate the problem of vision-language joint representation learning for VrDU mainly from the perspective of supervisory signals. Specifically, a pre-training paradigm called Bi-VLDoc is proposed, in which a bidirectional vision-language supervision strategy and a vision-language hybrid-attention mechanism are devised to fully explore and utilize the interactions between these two modalities, to learn stronger cross-modal document representations with richer semantics. Benefiting from the learned informative cross-modal document representations, Bi-VLDoc significantly advances the state-of-the-art performance on three widely-used document understanding benchmarks, including Form Understanding (from 85.14% to 93.44%), Receipt Information Extraction (from 96.01% to 97.84%), and Document Classification (from 96.08% to 97.12%). On Document Visual QA, Bi-VLDoc achieves the state-of-the-art performance compared to previous single model methods.