现代图像深网中的变形金刚和卷积的神秘面纱

论文标题

现代图像深网中的变形金刚和卷积的神秘面纱

Demystify Transformers & Convolutions in Modern Image Deep Networks

论文作者

Hu, Xiaowei, Shi, Min, Wang, Weiyun, Wu, Sitong, Xing, Linjie, Wang, Wenhai, Zhu, Xizhou, Lu, Lewei, Zhou, Jie, Wang, Xiaogang, Qiao, Yu, Dai, Jifeng

论文摘要

视觉变形金刚最近变得越来越受欢迎，从而导致具有改进功能和稳定性能增长的新视觉主机的发展。但是，这些进步不仅归因于新型特征转换设计。高级网络级别和块级体系结构也会产生某些好处。本文旨在通过一项详细研究来确定流行的卷积和注意力运营商的真正收益。我们发现，这些特征转换模块（例如注意力或卷积）之间的关键区别在于它们的空间特征聚集方法，称为“空间令牌混合器”（STM）。为了促进公正的比较，我们引入了统一的体系结构，以中和网络级别和块级设计的影响。随后，将各种STM集成到这个统一的框架中，以进行全面的比较分析。我们对各种任务的实验以及对归纳偏差的分析表明，由于高级网络级别和块级设计，其性能提高了，但性能差异在不同的STM之间持续存在。我们的详细分析还揭示了有关不同STM的各种发现，包括有效的接受场，不变性和对抗性鲁棒性测试。

Vision transformers have gained popularity recently, leading to the development of new vision backbones with improved features and consistent performance gains. However, these advancements are not solely attributable to novel feature transformation designs; certain benefits also arise from advanced network-level and block-level architectures. This paper aims to identify the real gains of popular convolution and attention operators through a detailed study. We find that the key difference among these feature transformation modules, such as attention or convolution, lies in their spatial feature aggregation approach, known as the "spatial token mixer" (STM). To facilitate an impartial comparison, we introduce a unified architecture to neutralize the impact of divergent network-level and block-level designs. Subsequently, various STMs are integrated into this unified framework for comprehensive comparative analysis. Our experiments on various tasks and an analysis of inductive bias show a significant performance boost due to advanced network-level and block-level designs, but performance differences persist among different STMs. Our detailed analysis also reveals various findings about different STMs, including effective receptive fields, invariance, and adversarial robustness tests.

下载PDF全文

下载文献需遵守相关版权规定

论文标题