针对对象蒙面的自动编码器，用于快速预训练

论文标题

针对对象蒙面的自动编码器，用于快速预训练

Object-wise Masked Autoencoders for Fast Pre-training

论文作者

Wu, Jiantao, Mo, Shentong

论文摘要

没有标签的图像的自我监督预训练最近在图像分类方面取得了有希望的表现。基于变压器的方法VIT和MAE的成功吸引了社区对骨干建筑和自我监管任务的设计的关注。在这项工作中，我们表明，当前的蒙版图像编码模型了解整个场景中所有对象之间的基本关系，而不是单个对象表示。因此，这些方法为自我监督的预训练带来了很多计算时间。为了解决此问题，我们引入了一种新颖的对象选择和除法策略，以通过具有感兴趣的区域掩码的选择性重建来删除学习对象表示的非对象贴片。我们指的是这种方法。在四个常用数据集上进行的广泛实验证明了我们模型在将计算成本降低72％的同时，同时实现竞争性能的有效性。此外，我们研究了对象间和对象之间的关系，发现后者对于自我监督的预训练至关重要。

Self-supervised pre-training for images without labels has recently achieved promising performance in image classification. The success of transformer-based methods, ViT and MAE, draws the community's attention to the design of backbone architecture and self-supervised task. In this work, we show that current masked image encoding models learn the underlying relationship between all objects in the whole scene, instead of a single object representation. Therefore, those methods bring a lot of compute time for self-supervised pre-training. To solve this issue, we introduce a novel object selection and division strategy to drop non-object patches for learning object-wise representations by selective reconstruction with interested region masks. We refer to this method ObjMAE. Extensive experiments on four commonly-used datasets demonstrate the effectiveness of our model in reducing the compute cost by 72% while achieving competitive performance. Furthermore, we investigate the inter-object and intra-object relationship and find that the latter is crucial for self-supervised pre-training.

下载PDF全文

下载文献需遵守相关版权规定

论文标题