Conv2Former：一种简单的变压器风格的转向器，用于视觉识别

论文标题

Conv2Former：一种简单的变压器风格的转向器，用于视觉识别

Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition

论文作者

Hou, Qibin, Lu, Cheng-Ze, Cheng, Ming-Ming, Feng, Jiashi

论文摘要

本文没有尝试设计一种最新的视觉识别方法，而是研究了一种更有效的方法来利用卷积来编码空间特征。通过比较近期卷积神经网络的设计原理和视觉变压器，我们建议通过利用卷积调制操作来简化自我注意力。我们表明，这种简单的方法可以更好地利用嵌套在卷积层中的大内核（> = 7x7）。我们使用拟议的卷积调制（称为Conv2Former）建立了一个层次转向的家族。我们的网络简单易于遵循。实验表明，我们的Conv2Former优于存在流行的交流器和视觉变压器，例如Swin Transformer和Convnext在所有Imagenet分类，可可对象检测和ADE20K语义分段中。

This paper does not attempt to design a state-of-the-art method for visual recognition but investigates a more efficient way to make use of convolutions to encode spatial features. By comparing the design principles of the recent convolutional neural networks ConvNets) and Vision Transformers, we propose to simplify the self-attention by leveraging a convolutional modulation operation. We show that such a simple approach can better take advantage of the large kernels (>=7x7) nested in convolutional layers. We build a family of hierarchical ConvNets using the proposed convolutional modulation, termed Conv2Former. Our network is simple and easy to follow. Experiments show that our Conv2Former outperforms existent popular ConvNets and vision Transformers, like Swin Transformer and ConvNeXt in all ImageNet classification, COCO object detection and ADE20k semantic segmentation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题