视觉变压器：VIT及其衍生物

论文标题

视觉变压器：VIT及其衍生物

Vision Transformer: Vit and its Derivatives

论文作者

Fu, Zujun

论文摘要

Transformer是一种基于注意力的编码器架构，不仅彻底改变了自然语言处理（NLP）的领域，而且还在计算机视觉（CV）领域做了一些开创性的工作。与卷积神经网络（CNN）相比，视觉变压器（VIT）依赖于出色的建模能力，可以在Imagenet，Coco和ADE20K等多个基准上实现良好的性能。 VIT灵感来自自然语言处理中的自我发挥机制，其中单词嵌入被贴片嵌入所取代。本文回顾了VIT的衍生物以及VIT与其他领域的交叉应用。

Transformer, an attention-based encoder-decoder architecture, has not only revolutionized the field of natural language processing (NLP), but has also done some pioneering work in the field of computer vision (CV). Compared to convolutional neural networks (CNNs), the Vision Transformer (ViT) relies on excellent modeling capabilities to achieve very good performance on several benchmarks such as ImageNet, COCO, and ADE20k. ViT is inspired by the self-attention mechanism in natural language processing, where word embeddings are replaced with patch embeddings. This paper reviews the derivatives of ViT and the cross-applications of ViT with other fields.

下载PDF全文

下载文献需遵守相关版权规定

论文标题