论文标题
杯子:一个多粒子自我监督的学习框架
Mugs: A Multi-Granular Self-Supervised Learning Framework
论文作者
论文摘要
在自我监督的学习中,尽管很少研究多个一个细胞特征,但由于很少研究,因为不同的下游任务(例如一般和细粒度的分类)通常需要不同或多个粒度特征,例如〜细或粗粒或粗粒或混合物。在这项工作中,我们首次提出了一个有效的多粒子自我监督学习(MUGS)框架,以明确学习多个粒度的视觉特征。杯子有三个互补的颗粒状监督:1)实例歧视监督(IDS),2)新型的本地组歧视监督(LGD)和3)组歧视监督(GDS)。 IDS区分了不同的实例,以学习实例级的细颗粒功能。 LGD汇总了图像及其邻居的特征,成为本地组功能,并将同一图像不同农作物的局部组特征一起吸引,并将它们推开为他人。它通过对本地邻居的额外对准为ID提供互补的实例监督,并分别散布不同的本地组以提高可区分性。因此,它有助于在本地组中学习高级细粒度的功能。最后,为了防止类似的本地组随机或远处散射,GD会使类似的样品关闭,从而将相似的局部组拉在一起,从而在(语义)组级别捕获粗粒的特征。因此,杯子可以捕获三个颗粒状特征,这些特征通常在单个粒度特征上的下游任务上享有更高的普遍性,例如〜实例级别的细粒度特征在对比度学习中。通过仅在Imagenet-1k上进行预处理,杯子将Imagenet-1K的新SOTA线性探测精度设置为82.1 $ \%$,并将以前的SOTA提高到$ 1.1 \%$ $。它还超过了其他任务的SOTA,例如转移学习,检测和细分。
In self-supervised learning, multi-granular features are heavily desired though rarely investigated, as different downstream tasks (e.g., general and fine-grained classification) often require different or multi-granular features, e.g.~fine- or coarse-grained one or their mixture. In this work, for the first time, we propose an effective MUlti-Granular Self-supervised learning (Mugs) framework to explicitly learn multi-granular visual features. Mugs has three complementary granular supervisions: 1) an instance discrimination supervision (IDS), 2) a novel local-group discrimination supervision (LGDS), and 3) a group discrimination supervision (GDS). IDS distinguishes different instances to learn instance-level fine-grained features. LGDS aggregates features of an image and its neighbors into a local-group feature, and pulls local-group features from different crops of the same image together and push them away for others. It provides complementary instance supervision to IDS via an extra alignment on local neighbors, and scatters different local-groups separately to increase discriminability. Accordingly, it helps learn high-level fine-grained features at a local-group level. Finally, to prevent similar local-groups from being scattered randomly or far away, GDS brings similar samples close and thus pulls similar local-groups together, capturing coarse-grained features at a (semantic) group level. Consequently, Mugs can capture three granular features that often enjoy higher generality on diverse downstream tasks over single-granular features, e.g.~instance-level fine-grained features in contrastive learning. By only pretraining on ImageNet-1K, Mugs sets new SoTA linear probing accuracy 82.1$\%$ on ImageNet-1K and improves previous SoTA by $1.1\%$. It also surpasses SoTAs on other tasks, e.g. transfer learning, detection and segmentation.