MC-BEIT：图像BERT预训练的多项选择离散化

论文标题

MC-BEIT：图像BERT预训练的多项选择离散化

mc-BEiT: Multi-choice Discretization for Image BERT Pre-training

论文作者

Li, Xiaotong, Ge, Yixiao, Yi, Kun, Hu, Zixuan, Shan, Ying, Duan, Ling-Yu

论文摘要

图像BERT使用掩盖图像建模（MIM）预训练成为应对自我监督的表示学习的一种流行实践。一项开创性的作品将MIM作为一个视觉词汇作为分类任务，将连续的视觉信号用于使用预先学习的DVAE将连续的视觉信号归为离散的视觉令牌。尽管有可行的解决方案，但不当离散化会阻碍图像预训练的进一步改善。由于图像离散化没有基础真相的答案，因此我们认为，即使可以获得更好的令牌，也不应使用唯一的令牌ID分配蒙版的补丁。在这项工作中，我们引入了改进的BERT风格图像预训练方法，即MC-BEIT，该方法执行MIM代理任务，以放松和精致的多选择培训目标。具体而言，蒙版图像贴片的多项选择监督是由离散令牌ID的软概率向量形成的，该概率向量由现成的图像令牌图形器预测，并通过高水平的互面间感知来进一步完善，从而诉诸于观察到类似贴片的观察结果。 Extensive experiments on classification, segmentation, and detection tasks demonstrate the superiority of our method, e.g., the pre-trained ViT-B achieves 84.1% top-1 fine-tuning accuracy on ImageNet-1K classification, 49.2% AP^b and 44.0% AP^m of object detection and instance segmentation on COCO, 50.8% mIOU on ADE20K semantic segmentation, outperforming the competitive同行。该代码将在https://github.com/lixiaotong97/mc-beit上找到。

Image BERT pre-training with masked image modeling (MIM) becomes a popular practice to cope with self-supervised representation learning. A seminal work, BEiT, casts MIM as a classification task with a visual vocabulary, tokenizing the continuous visual signals into discrete vision tokens using a pre-learned dVAE. Despite a feasible solution, the improper discretization hinders further improvements of image pre-training. Since image discretization has no ground-truth answers, we believe that the masked patch should not be assigned with a unique token id even if a better tokenizer can be obtained. In this work, we introduce an improved BERT-style image pre-training method, namely mc-BEiT, which performs MIM proxy tasks towards eased and refined multi-choice training objectives. Specifically, the multi-choice supervision for the masked image patches is formed by the soft probability vectors of the discrete token ids, which are predicted by the off-the-shelf image tokenizer and further refined by high-level inter-patch perceptions resorting to the observation that similar patches should share their choices. Extensive experiments on classification, segmentation, and detection tasks demonstrate the superiority of our method, e.g., the pre-trained ViT-B achieves 84.1% top-1 fine-tuning accuracy on ImageNet-1K classification, 49.2% AP^b and 44.0% AP^m of object detection and instance segmentation on COCO, 50.8% mIOU on ADE20K semantic segmentation, outperforming the competitive counterparts. The code will be available at https://github.com/lixiaotong97/mc-BEiT.

下载PDF全文

下载文献需遵守相关版权规定

论文标题