论文标题
无需训练的自动缩放视觉变压器
Auto-scaling Vision Transformers without Training
论文作者
论文摘要
这项工作针对视力变压器(VIT)的自动设计和缩放。动机来自两个疼痛点:1)缺乏设计和缩放VIT的高效和原则性方法; 2)培训VIT的巨大计算成本比其卷积对应物重得多。为了解决这些问题,我们提出了AS-Vit,这是一个无需培训的VIT的自动缩放框架,它会自动发现和扩大VIT的效率和原则性的方式。具体而言,我们首先通过利用无训练的搜索过程来设计“种子” VIT拓扑。通过对VIT网络复杂性的全面研究,实现了这种非常快速的搜索,从而与地面准确性产生了牢固的Kendall-Tau相关性。其次,从“种子”拓扑开始,我们通过将宽度/深度生长到不同的VIT层来自动化VIT的缩放规则。这会导致一系列在一次运行中具有不同参数数量的体系结构。最后,基于VIT可以忍受早期训练阶段粗糙的令牌化的观察,我们提出了一种进行性的令牌化策略,以更快,更便宜。作为一个统一的框架,AS-VIT在分类(Imagenet-1K上的TOP1)和检测(可可的52.7%地图)上实现了强劲的性能,而无需任何手动制定或VIT体系结构的缩放:端到端模型设计和缩放过程仅在一个V100 GPU上成本12小时。我们的代码可在https://github.com/vita-group/asvit上找到。
This work targets automated designing and scaling of Vision Transformers (ViTs). The motivation comes from two pain spots: 1) the lack of efficient and principled methods for designing and scaling ViTs; 2) the tremendous computational cost of training ViT that is much heavier than its convolution counterpart. To tackle these issues, we propose As-ViT, an auto-scaling framework for ViTs without training, which automatically discovers and scales up ViTs in an efficient and principled manner. Specifically, we first design a "seed" ViT topology by leveraging a training-free search process. This extremely fast search is fulfilled by a comprehensive study of ViT's network complexity, yielding a strong Kendall-tau correlation with ground-truth accuracies. Second, starting from the "seed" topology, we automate the scaling rule for ViTs by growing widths/depths to different ViT layers. This results in a series of architectures with different numbers of parameters in a single run. Finally, based on the observation that ViTs can tolerate coarse tokenization in early training stages, we propose a progressive tokenization strategy to train ViTs faster and cheaper. As a unified framework, As-ViT achieves strong performance on classification (83.5% top1 on ImageNet-1k) and detection (52.7% mAP on COCO) without any manual crafting nor scaling of ViT architectures: the end-to-end model design and scaling process cost only 12 hours on one V100 GPU. Our code is available at https://github.com/VITA-Group/AsViT.