Flexivit：所有补丁尺寸的型号

论文标题

Flexivit：所有补丁尺寸的型号

FlexiViT: One Model for All Patch Sizes

论文作者

Beyer, Lucas, Izmailov, Pavel, Kolesnikov, Alexander, Caron, Mathilde, Kornblith, Simon, Zhai, Xiaohua, Minderer, Matthias, Tschannen, Michael, Alabdulmohsin, Ibrahim, Pavetic, Filip

论文摘要

视觉变压器通过将图像切成斑块将图像转换为序列。这些补丁的大小控制速度/准确性权衡，较小的贴片会在更高的计算成本下导致更高的精度，但是更改补丁的大小通常需要重新训练模型。在本文中，我们证明，仅在训练时间内随机将贴片大小随机化会导致一组重量，这些权重在各种贴片尺寸上都可以很好地表现，从而使模型在部署时可以定制模型以不同的计算预算。我们广泛评估了所得模型，我们将其称为Flexivit在各种任务上，包括分类，图像文本检索，开放世界检测，全景进行分割和语义细分，得出结论，通常匹配，有时匹配，有时超过了在其他相同的设置中以单个贴剂大小训练的标准VIT模型。因此，Flexivit训练是VIT的简单介绍改进，它使得依靠VIT主链体系结构的大多数型号添加了计算适应能力。代码和预培训模型可在https://github.com/google-research/big_vision上找到

Vision Transformers convert images to sequences by slicing them into patches. The size of these patches controls a speed/accuracy tradeoff, with smaller patches leading to higher accuracy at greater computational cost, but changing the patch size typically requires retraining the model. In this paper, we demonstrate that simply randomizing the patch size at training time leads to a single set of weights that performs well across a wide range of patch sizes, making it possible to tailor the model to different compute budgets at deployment time. We extensively evaluate the resulting model, which we call FlexiViT, on a wide range of tasks, including classification, image-text retrieval, open-world detection, panoptic segmentation, and semantic segmentation, concluding that it usually matches, and sometimes outperforms, standard ViT models trained at a single patch size in an otherwise identical setup. Hence, FlexiViT training is a simple drop-in improvement for ViT that makes it easy to add compute-adaptive capabilities to most models relying on a ViT backbone architecture. Code and pre-trained models are available at https://github.com/google-research/big_vision

下载PDF全文

下载文献需遵守相关版权规定

论文标题