视力语言预训练用于增强场景文本探测器

论文标题

视力语言预训练用于增强场景文本探测器

Vision-Language Pre-Training for Boosting Scene Text Detectors

论文作者

Song, Sibo, Wan, Jianqiang, Yang, Zhibo, Tang, Jun, Cheng, Wenqing, Bai, Xiang, Yao, Cong

论文摘要

最近，在各种情况下，Vision-Linage联合表示学习已被证明非常有效。在本文中，我们专门将视觉语言的联合学习用于场景文本检测，这项任务本质上涉及两种方式之间的跨模式互动：视觉和语言，因为文本是语言的书面形式。具体而言，我们建议通过视觉语言预训练来学习上下文化的联合表示，以增强场景文本检测器的性能。为此，我们设计了一个具有图像编码器，文本编码器和跨模式编码器的训练架构，以及三个借口任务：图像 - 文本对比学习（ITC），掩盖语言建模（MLM）和文字形象图像预测（WIP）。预先训练的模型能够用更丰富的语义产生更有信息的表示，可以在下游文本检测任务中容易受益于现有场景文本检测器（例如East和Psenet）。对标准基准测试的广泛实验表明，所提出的范式可以显着提高各种代表性文本检测器的性能，从而超过以前的预训练方法。代码和预培训模型将公开发布。

Recently, vision-language joint representation learning has proven to be highly effective in various scenarios. In this paper, we specifically adapt vision-language joint learning for scene text detection, a task that intrinsically involves cross-modal interaction between the two modalities: vision and language, since text is the written form of language. Concretely, we propose to learn contextualized, joint representations through vision-language pre-training, for the sake of enhancing the performance of scene text detectors. Towards this end, we devise a pre-training architecture with an image encoder, a text encoder and a cross-modal encoder, as well as three pretext tasks: image-text contrastive learning (ITC), masked language modeling (MLM) and word-in-image prediction (WIP). The pre-trained model is able to produce more informative representations with richer semantics, which could readily benefit existing scene text detectors (such as EAST and PSENet) in the down-stream text detection task. Extensive experiments on standard benchmarks demonstrate that the proposed paradigm can significantly improve the performance of various representative text detectors, outperforming previous pre-training approaches. The code and pre-trained models will be publicly released.

下载PDF全文

下载文献需遵守相关版权规定

论文标题