剥离洋葱：有效视觉变压器训练数据冗余的分层降低

论文标题

剥离洋葱：有效视觉变压器训练数据冗余的分层降低

Peeling the Onion: Hierarchical Reduction of Data Redundancy for Efficient Vision Transformer Training

论文作者

Kong, Zhenglun, Ma, Haoyu, Yuan, Geng, Sun, Mengshu, Xie, Yanyue, Dong, Peiyan, Meng, Xin, Shen, Xuan, Tang, Hao, Qin, Minghai, Chen, Tianlong, Ma, Xiaolong, Xie, Xiaohui, Wang, Zhangyang, Wang, Yanzhi

论文摘要

视觉变压器（VIT）最近在许多应用中都取得了成功，但是在训练和推理时间时，它们的密集计算和大量记忆使用限制了它们的概括。以前的压缩算法通常始于预训练的密集模型，仅关注有效的推断，而耗时的训练仍然不可避免。相比之下，本文指出，百万级培训数据是多余的，这是进行乏味培训的基本原因。为了解决这个问题，本文旨在将稀疏性引入数据中，并从三个稀疏的角度（称为Tri-Level E-Vit）提出一个端到端的有效培训框架。具体而言，我们通过探索三个级别的稀疏性来利用分层数据冗余方案：数据集中的训练示例数量，每个示例中的补丁数（令牌）数量以及注意力重量的代币之间的连接数量。通过广泛的实验，我们证明了我们提出的技术可以明显地加速各种VIT体系结构的培训，同时保持准确性。值得注意的是，在某些比率下，我们能够提高VIT准确性，而不是损害它。例如，我们可以在DEIT-T上获得15.2％的加速度，而DEIT-T的72.6％（+0.4）TOP-1精度，而DEIT-S上的79.9％（+0.1）TOP-1精度为15.7％。这证明了VIT中数据冗余的存在。

Vision transformers (ViTs) have recently obtained success in many applications, but their intensive computation and heavy memory usage at both training and inference time limit their generalization. Previous compression algorithms usually start from the pre-trained dense models and only focus on efficient inference, while time-consuming training is still unavoidable. In contrast, this paper points out that the million-scale training data is redundant, which is the fundamental reason for the tedious training. To address the issue, this paper aims to introduce sparsity into data and proposes an end-to-end efficient training framework from three sparse perspectives, dubbed Tri-Level E-ViT. Specifically, we leverage a hierarchical data redundancy reduction scheme, by exploring the sparsity under three levels: number of training examples in the dataset, number of patches (tokens) in each example, and number of connections between tokens that lie in attention weights. With extensive experiments, we demonstrate that our proposed technique can noticeably accelerate training for various ViT architectures while maintaining accuracy. Remarkably, under certain ratios, we are able to improve the ViT accuracy rather than compromising it. For example, we can achieve 15.2% speedup with 72.6% (+0.4) Top-1 accuracy on Deit-T, and 15.7% speedup with 79.9% (+0.1) Top-1 accuracy on Deit-S. This proves the existence of data redundancy in ViT.

下载PDF全文

下载文献需遵守相关版权规定

论文标题