论文标题
MC-BEIT:图像BERT预训练的多项选择离散化
mc-BEiT: Multi-choice Discretization for Image BERT Pre-training
论文作者
论文摘要
图像BERT使用掩盖图像建模(MIM)预训练成为应对自我监督的表示学习的一种流行实践。一项开创性的作品将MIM作为一个视觉词汇作为分类任务,将连续的视觉信号用于使用预先学习的DVAE将连续的视觉信号归为离散的视觉令牌。尽管有可行的解决方案,但不当离散化会阻碍图像预训练的进一步改善。由于图像离散化没有基础真相的答案,因此我们认为,即使可以获得更好的令牌,也不应使用唯一的令牌ID分配蒙版的补丁。在这项工作中,我们引入了改进的BERT风格图像预训练方法,即MC-BEIT,该方法执行MIM代理任务,以放松和精致的多选择培训目标。具体而言,蒙版图像贴片的多项选择监督是由离散令牌ID的软概率向量形成的,该概率向量由现成的图像令牌图形器预测,并通过高水平的互面间感知来进一步完善,从而诉诸于观察到类似贴片的观察结果。 Extensive experiments on classification, segmentation, and detection tasks demonstrate the superiority of our method, e.g., the pre-trained ViT-B achieves 84.1% top-1 fine-tuning accuracy on ImageNet-1K classification, 49.2% AP^b and 44.0% AP^m of object detection and instance segmentation on COCO, 50.8% mIOU on ADE20K semantic segmentation, outperforming the competitive同行。该代码将在https://github.com/lixiaotong97/mc-beit上找到。
Image BERT pre-training with masked image modeling (MIM) becomes a popular practice to cope with self-supervised representation learning. A seminal work, BEiT, casts MIM as a classification task with a visual vocabulary, tokenizing the continuous visual signals into discrete vision tokens using a pre-learned dVAE. Despite a feasible solution, the improper discretization hinders further improvements of image pre-training. Since image discretization has no ground-truth answers, we believe that the masked patch should not be assigned with a unique token id even if a better tokenizer can be obtained. In this work, we introduce an improved BERT-style image pre-training method, namely mc-BEiT, which performs MIM proxy tasks towards eased and refined multi-choice training objectives. Specifically, the multi-choice supervision for the masked image patches is formed by the soft probability vectors of the discrete token ids, which are predicted by the off-the-shelf image tokenizer and further refined by high-level inter-patch perceptions resorting to the observation that similar patches should share their choices. Extensive experiments on classification, segmentation, and detection tasks demonstrate the superiority of our method, e.g., the pre-trained ViT-B achieves 84.1% top-1 fine-tuning accuracy on ImageNet-1K classification, 49.2% AP^b and 44.0% AP^m of object detection and instance segmentation on COCO, 50.8% mIOU on ADE20K semantic segmentation, outperforming the competitive counterparts. The code will be available at https://github.com/lixiaotong97/mc-BEiT.