论文标题
PACA-VIT:在视觉变形金刚学习补丁到群集的关注
PaCa-ViT: Learning Patch-to-Cluster Attention in Vision Transformers
论文作者
论文摘要
视觉变压器(VIT)建立在将图像斑块视为``视觉令牌''并学习斑点注意的假设上。基于贴片的基于嵌入的令牌材料在其文本标记方面具有语义上的差距。文本标记器。对应物的注意力集中在pacterage pacts to-patch catch aff to the Quadratic的复杂性问题,也可以很好地解释这一问题,以使其对知识率进行了解释。在我们的paca vit中,斑块到群集(PACA)是基于聚类的,而钥匙和值直接是基于聚类(少量的群集)提出的PACA模块用于设计有效的,可解释的VIT骨干和语义分段网络。 MIT-ADE20K也比MS-Coco和MIT-ADE20K的PVT模型明显更高,因为学到的群集在语义上有意义。
Vision Transformers (ViTs) are built on the assumption of treating image patches as ``visual tokens" and learn patch-to-patch attention. The patch embedding based tokenizer has a semantic gap with respect to its counterpart, the textual tokenizer. The patch-to-patch attention suffers from the quadratic complexity issue, and also makes it non-trivial to explain learned ViTs. To address these issues in ViT, this paper proposes to learn Patch-to-Cluster attention (PaCa) in ViT. Queries in our PaCa-ViT starts with patches, while keys and values are directly based on clustering (with a predefined small number of clusters). The clusters are learned end-to-end, leading to better tokenizers and inducing joint clustering-for-attention and attention-for-clustering for better and interpretable models. The quadratic complexity is relaxed to linear complexity. The proposed PaCa module is used in designing efficient and interpretable ViT backbones and semantic segmentation head networks. In experiments, the proposed methods are tested on ImageNet-1k image classification, MS-COCO object detection and instance segmentation and MIT-ADE20k semantic segmentation. Compared with the prior art, it obtains better performance in all the three benchmarks than the SWin and the PVTs by significant margins in ImageNet-1k and MIT-ADE20k. It is also significantly more efficient than PVT models in MS-COCO and MIT-ADE20k due to the linear complexity. The learned clusters are semantically meaningful. Code and model checkpoints are available at https://github.com/iVMCL/PaCaViT.