Imagebert：大规模弱监督图像文本数据的跨模式预训练

论文标题

Imagebert：大规模弱监督图像文本数据的跨模式预训练

ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data

论文作者

Qi, Di, Su, Lin, Song, Jia, Cui, Edward, Bharti, Taroon, Sacheti, Arun

论文摘要

在本文中，我们介绍了一个新的视觉语言预培训模型-Imagebert-用于图像文本关节嵌入。我们的模型是一个基于变压器的模型，该模型采用不同的方式作为输入并建模它们之间的关系。该模型同时在四个任务上进行了预训练：蒙版语言建模（MLM），蒙版对象分类（MOC），蒙版区域特征回归（MRFR）和图像文本匹配（ITM）。为了进一步提高培训质量，我们从Web收集了一个大规模的弱监督图像文本（LAIT）数据集。我们首先在此数据集上预先培训模型，然后对概念字幕和SBU字幕进行第二阶段预训练。我们的实验表明，多阶段预训练策略的表现优于单阶段的预训练。我们还可以在图像检索和文本检索任务上微调和评估我们的预训练的Imagebert模型，并在MSCOCO和FLICKR30K数据集上实现新的最新结果。

In this paper, we introduce a new vision-language pre-trained model -- ImageBERT -- for image-text joint embedding. Our model is a Transformer-based model, which takes different modalities as input and models the relationship between them. The model is pre-trained on four tasks simultaneously: Masked Language Modeling (MLM), Masked Object Classification (MOC), Masked Region Feature Regression (MRFR), and Image Text Matching (ITM). To further enhance the pre-training quality, we have collected a Large-scale weAk-supervised Image-Text (LAIT) dataset from Web. We first pre-train the model on this dataset, then conduct a second stage pre-training on Conceptual Captions and SBU Captions. Our experiments show that multi-stage pre-training strategy outperforms single-stage pre-training. We also fine-tune and evaluate our pre-trained ImageBERT model on image retrieval and text retrieval tasks, and achieve new state-of-the-art results on both MSCOCO and Flickr30k datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题