EATFORMER：改进受到进化算法启发的视觉变压器

论文标题

EATFORMER：改进受到进化算法启发的视觉变压器

EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm

论文作者

Zhang, Jiangning, Li, Xiangtai, Wang, Yabiao, Wang, Chengjie, Yang, Yibo, Liu, Yong, Tao, Dacheng

论文摘要

由生物学进化的动机，本文通过类比与经过验证的实用进化算法（EA）的类比解释了视觉变压器的合理性，并得出了两者都具有一致的数学表述。然后，我们受到有效的EA变体的启发，我们提出了一种新型的金字塔饮食型主链，该骨架仅包含拟议的基于EA的变压器（EAT）块，该块由三个残留零件组成，即，即多尺度区域聚合，全球和本地互动，以及馈电网络模块，以及对多端，互动，互动，互动，互动，互动和个人信息进行模拟。此外，我们设计了一个与变压器主链对接的任务相关的头，以更灵活地完成最终信息融合，并改善模型可变形的MSA，以动态建模不规则的位置。关于图像分类，下游任务和解释性实验的大量定量和定量实验证明了我们方法比最新方法的有效性和优越性。例如，我们的手机（1.8 m），微小（6.1 m），小（24.3 m）和基础（49.0 m）型号的模型达到了69.4、78.4、83.1和83.9的83.9 TOP-1仅在Imagenet-1k上接受幼稚训练的训练；在可可检测中获得了45.4/47.4/49.0盒AP和41.4/44.2蒙版AP，可可检测中获得45.4/47.4/49.0盒AP和41.4/44.2掩膜，超过现代MPVIT-T，SWIN-T和SWIN-S超过0.6/1.4/1.4/0.5盒AP和0.4/0.4/1.3/0.9 Masp AP;我们的Eatformer-small/base在Upernet上的ADE20K上实现了47.3/49.3 MIOU，超过Swin-T/S超过2.8/1.7。代码可从https://github.com/zhangzjn/eatformer获得。

Motivated by biological evolution, this paper explains the rationality of Vision Transformer by analogy with the proven practical evolutionary algorithm (EA) and derives that both have consistent mathematical formulation. Then inspired by effective EA variants, we propose a novel pyramid EATFormer backbone that only contains the proposed EA-based transformer (EAT) block, which consists of three residual parts, i.e., Multi-scale region aggregation, global and local interaction, and feed-forward network modules, to model multi-scale, interactive, and individual information separately. Moreover, we design a task-related head docked with transformer backbone to complete final information fusion more flexibly and improve a modulated deformable MSA to dynamically model irregular locations. Massive quantitative and quantitative experiments on image classification, downstream tasks, and explanatory experiments demonstrate the effectiveness and superiority of our approach over state-of-the-art methods. E.g., our Mobile (1.8 M), Tiny (6.1 M), Small (24.3 M), and Base (49.0 M) models achieve 69.4, 78.4, 83.1, and 83.9 Top-1 only trained on ImageNet-1K with naive training recipe; EATFormer-Tiny/Small/Base armed Mask-R-CNN obtain 45.4/47.4/49.0 box AP and 41.4/42.9/44.2 mask AP on COCO detection, surpassing contemporary MPViT-T, Swin-T, and Swin-S by 0.6/1.4/0.5 box AP and 0.4/1.3/0.9 mask AP separately with less FLOPs; Our EATFormer-Small/Base achieve 47.3/49.3 mIoU on ADE20K by Upernet that exceeds Swin-T/S by 2.8/1.7. Code is available at https://github.com/zhangzjn/EATFormer.

下载PDF全文

下载文献需遵守相关版权规定

论文标题