论文标题
多尾视觉变压器以提高推理
Multi-Tailed Vision Transformer for Efficient Inference
论文作者
论文摘要
最近,Vision Transformer(VIT)在图像识别方面取得了有希望的表现,并逐渐成为各种视觉任务的强大骨干。为了满足变压器的顺序输入,VIT的尾部首先将每个图像分为固定长度的视觉令牌。然后,以下自我发场层构建了代币之间的全球关系,以为下游任务产生有用的表示形式。从经验上讲,用更多令牌代表图像会带来更好的性能,但是自我注意力层与令牌数量的二次计算复杂性可能会严重影响VIT推断的效率。为了减少计算,一些修剪的方法逐渐修剪变压器编码器中的无信息令牌,同时在变压器未触及之前留下令牌数量。实际上,作为变压器编码器的输入的代币更少可以直接降低以下计算成本。本着这种精神,我们在论文中提出了一个多尾视觉变压器(MT-VIT)。 MT-VIT采用多个尾巴来产生以下变压器编码器不同长度的视觉序列。引入了尾部预测变量,以确定哪个尾巴最有效地产生了准确的预测。这两个模块都以胶囊 - 柔软的技巧以端到端的方式进行了优化。 Imagenet-1k的实验表明,MT-VIT可以在拖船上显着降低,而准确性和触发器的精度也不优于其他比较其他方法。
Recently, Vision Transformer (ViT) has achieved promising performance in image recognition and gradually serves as a powerful backbone in various vision tasks. To satisfy the sequential input of Transformer, the tail of ViT first splits each image into a sequence of visual tokens with a fixed length. Then the following self-attention layers constructs the global relationship between tokens to produce useful representation for the downstream tasks. Empirically, representing the image with more tokens leads to better performance, yet the quadratic computational complexity of self-attention layer to the number of tokens could seriously influence the efficiency of ViT's inference. For computational reduction, a few pruning methods progressively prune uninformative tokens in the Transformer encoder, while leaving the number of tokens before the Transformer untouched. In fact, fewer tokens as the input for the Transformer encoder can directly reduce the following computational cost. In this spirit, we propose a Multi-Tailed Vision Transformer (MT-ViT) in the paper. MT-ViT adopts multiple tails to produce visual sequences of different lengths for the following Transformer encoder. A tail predictor is introduced to decide which tail is the most efficient for the image to produce accurate prediction. Both modules are optimized in an end-to-end fashion, with the Gumbel-Softmax trick. Experiments on ImageNet-1K demonstrate that MT-ViT can achieve a significant reduction on FLOPs with no degradation of the accuracy and outperform other compared methods in both accuracy and FLOPs.