论文标题

视力语言预训练用于增强场景文本探测器

Vision-Language Pre-Training for Boosting Scene Text Detectors

论文作者

Song, Sibo, Wan, Jianqiang, Yang, Zhibo, Tang, Jun, Cheng, Wenqing, Bai, Xiang, Yao, Cong

论文摘要

最近,在各种情况下,Vision-Linage联合表示学习已被证明非常有效。在本文中,我们专门将视觉语言的联合学习用于场景文本检测,这项任务本质上涉及两种方式之间的跨模式互动:视觉和语言,因为文本是语言的书面形式。具体而言,我们建议通过视觉语言预训练来学习上下文化的联合表示,以增强场景文本检测器的性能。为此,我们设计了一个具有图像编码器,文本编码器和跨模式编码器的训练架构,以及三个借口任务:图像 - 文本对比学习(ITC),掩盖语言建模(MLM)和文字形象图像预测(WIP)。预先训练的模型能够用更丰富的语义产生更有信息的表示,可以在下游文本检测任务中容易受益于现有场景文本检测器(例如East和Psenet)。对标准基准测试的广泛实验表明,所提出的范式可以显着提高各种代表性文本检测器的性能,从而超过以前的预训练方法。代码和预培训模型将公开发布。

Recently, vision-language joint representation learning has proven to be highly effective in various scenarios. In this paper, we specifically adapt vision-language joint learning for scene text detection, a task that intrinsically involves cross-modal interaction between the two modalities: vision and language, since text is the written form of language. Concretely, we propose to learn contextualized, joint representations through vision-language pre-training, for the sake of enhancing the performance of scene text detectors. Towards this end, we devise a pre-training architecture with an image encoder, a text encoder and a cross-modal encoder, as well as three pretext tasks: image-text contrastive learning (ITC), masked language modeling (MLM) and word-in-image prediction (WIP). The pre-trained model is able to produce more informative representations with richer semantics, which could readily benefit existing scene text detectors (such as EAST and PSENet) in the down-stream text detection task. Extensive experiments on standard benchmarks demonstrate that the proposed paradigm can significantly improve the performance of various representative text detectors, outperforming previous pre-training approaches. The code and pre-trained models will be publicly released.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源