TokenMix：重新思考图像混合以进行视觉变压器中的数据增强

论文标题

TokenMix：重新思考图像混合以进行视觉变压器中的数据增强

TokenMix: Rethinking Image Mixing for Data Augmentation in Vision Transformers

论文作者

Liu, Jihao, Liu, Boxiao, Zhou, Hang, Li, Hongsheng, Liu, Yu

论文摘要

CutMix是一种流行的增强技术，通常用于训练现代卷积和变压器视觉网络。它最初旨在鼓励卷积神经网络（CNN）更多地专注于图像的全球环境，而不是本地信息，从而大大提高了CNN的性能。但是，我们发现它对自然具有全球接收领域的基于变压器的体系结构的好处有限。在本文中，我们提出了一种新型的数据增强技术图，以提高视觉变压器的性能。 TokenMix通过将混合区分为多个分离的部分将两个图像在令牌级别混合。此外，我们表明，CutMix中的混合学习目标是一对地面真相标签的线性组合，可能是不准确的，有时是违反直觉的。为了获得更合适的目标，我们建议根据预先训练的教师模型的两个图像的基于内容的神经激活图分配目标得分，该图像不需要具有高性能。通过大量有关各种视觉变压器体系结构的实验，我们表明我们提出的TokenMix有助于视觉变形金刚专注于前景区域，以推断阶级并增强其稳健性，以稳定的性能增长。值得注意的是，我们使用 +1％Imagenet TOP-1精度改善DEIT-T/S/B。此外，TokenMix的训练较长，在Imagenet上获得了81.2％的TOP-1精度，而DEIT-S进行了400个时期的DEIT-S。代码可从https://github.com/sense-x/tokenmix获得。

CutMix is a popular augmentation technique commonly used for training modern convolutional and transformer vision networks. It was originally designed to encourage Convolution Neural Networks (CNNs) to focus more on an image's global context instead of local information, which greatly improves the performance of CNNs. However, we found it to have limited benefits for transformer-based architectures that naturally have a global receptive field. In this paper, we propose a novel data augmentation technique TokenMix to improve the performance of vision transformers. TokenMix mixes two images at token level via partitioning the mixing region into multiple separated parts. Besides, we show that the mixed learning target in CutMix, a linear combination of a pair of the ground truth labels, might be inaccurate and sometimes counter-intuitive. To obtain a more suitable target, we propose to assign the target score according to the content-based neural activation maps of the two images from a pre-trained teacher model, which does not need to have high performance. With plenty of experiments on various vision transformer architectures, we show that our proposed TokenMix helps vision transformers focus on the foreground area to infer the classes and enhances their robustness to occlusion, with consistent performance gains. Notably, we improve DeiT-T/S/B with +1% ImageNet top-1 accuracy. Besides, TokenMix enjoys longer training, which achieves 81.2% top-1 accuracy on ImageNet with DeiT-S trained for 400 epochs. Code is available at https://github.com/Sense-X/TokenMix.

下载PDF全文

下载文献需遵守相关版权规定

论文标题