无需训练的自动缩放视觉变压器

论文标题

无需训练的自动缩放视觉变压器

Auto-scaling Vision Transformers without Training

论文作者

Chen, Wuyang, Huang, Wei, Du, Xianzhi, Song, Xiaodan, Wang, Zhangyang, Zhou, Denny

论文摘要

这项工作针对视力变压器（VIT）的自动设计和缩放。动机来自两个疼痛点：1）缺乏设计和缩放VIT的高效和原则性方法； 2）培训VIT的巨大计算成本比其卷积对应物重得多。为了解决这些问题，我们提出了AS-Vit，这是一个无需培训的VIT的自动缩放框架，它会自动发现和扩大VIT的效率和原则性的方式。具体而言，我们首先通过利用无训练的搜索过程来设计“种子” VIT拓扑。通过对VIT网络复杂性的全面研究，实现了这种非常快速的搜索，从而与地面准确性产生了牢固的Kendall-Tau相关性。其次，从“种子”拓扑开始，我们通过将宽度/深度生长到不同的VIT层来自动化VIT的缩放规则。这会导致一系列在一次运行中具有不同参数数量的体系结构。最后，基于VIT可以忍受早期训练阶段粗糙的令牌化的观察，我们提出了一种进行性的令牌化策略，以更快，更便宜。作为一个统一的框架，AS-VIT在分类（Imagenet-1K上的TOP1）和检测（可可的52.7％地图）上实现了强劲的性能，而无需任何手动制定或VIT体系结构的缩放：端到端模型设计和缩放过程仅在一个V100 GPU上成本12小时。我们的代码可在https://github.com/vita-group/asvit上找到。

This work targets automated designing and scaling of Vision Transformers (ViTs). The motivation comes from two pain spots: 1) the lack of efficient and principled methods for designing and scaling ViTs; 2) the tremendous computational cost of training ViT that is much heavier than its convolution counterpart. To tackle these issues, we propose As-ViT, an auto-scaling framework for ViTs without training, which automatically discovers and scales up ViTs in an efficient and principled manner. Specifically, we first design a "seed" ViT topology by leveraging a training-free search process. This extremely fast search is fulfilled by a comprehensive study of ViT's network complexity, yielding a strong Kendall-tau correlation with ground-truth accuracies. Second, starting from the "seed" topology, we automate the scaling rule for ViTs by growing widths/depths to different ViT layers. This results in a series of architectures with different numbers of parameters in a single run. Finally, based on the observation that ViTs can tolerate coarse tokenization in early training stages, we propose a progressive tokenization strategy to train ViTs faster and cheaper. As a unified framework, As-ViT achieves strong performance on classification (83.5% top1 on ImageNet-1k) and detection (52.7% mAP on COCO) without any manual crafting nor scaling of ViT architectures: the end-to-end model design and scaling process cost only 12 hours on one V100 GPU. Our code is available at https://github.com/VITA-Group/AsViT.

下载PDF全文

下载文献需遵守相关版权规定

论文标题