超越固定：动态窗口视觉变压器

论文标题

超越固定：动态窗口视觉变压器

Beyond Fixation: Dynamic Window Visual Transformer

论文作者

Ren, Pengzhen, Li, Changlin, Wang, Guangrun, Xiao, Yun, Du, Qing, Liang, Xiaodan, Chang, Xiaojun

论文摘要

最近，对视觉变压器的兴趣激增是通过将自我注意力的计算限制为本地窗口来降低计算成本。当前的大多数工作都使用固定的单尺度窗口来默认情况下建模，而忽略了窗口大小对模型性能的影响。但是，这可能会限制这些基于窗口的模型在多尺度信息中的建模潜力。在本文中，我们提出了一种新颖的方法，称为动态窗景变压器（DW-VIT）。 DW-Vit提出的动态窗口策略超出了采用固定单窗口设置的模型。据我们所知，我们是第一个使用动态多尺度窗口来探索窗口设置对模型性能的影响的上限。在DW-VIT中，通过将不同尺寸的窗口分配给窗户多头自我注意的不同头组来获得多尺度信息。然后，通过将不同的权重分配给多尺度窗口分支，将信息动态融合。我们在三个数据集（ImageNet-1k，ade20k和可可）上进行了详细的性能评估。与相关的最新方法（SOTA）方法相比，DW-Vit获得了最佳性能。具体而言，与当前的SOTA SWIN Transformers \ Cite {Liu2021Swin}相比，DW-Vit在所有三个数据集中都取得了一致和实质性的改进，这些数据集具有相似的参数和计算成本。此外，DW-VIT具有良好的可扩展性，并且可以轻松地插入任何基于窗口的视觉变压器中。

Recently, a surge of interest in visual transformers is to reduce the computational cost by limiting the calculation of self-attention to a local window. Most current work uses a fixed single-scale window for modeling by default, ignoring the impact of window size on model performance. However, this may limit the modeling potential of these window-based models for multi-scale information. In this paper, we propose a novel method, named Dynamic Window Vision Transformer (DW-ViT). The dynamic window strategy proposed by DW-ViT goes beyond the model that employs a fixed single window setting. To the best of our knowledge, we are the first to use dynamic multi-scale windows to explore the upper limit of the effect of window settings on model performance. In DW-ViT, multi-scale information is obtained by assigning windows of different sizes to different head groups of window multi-head self-attention. Then, the information is dynamically fused by assigning different weights to the multi-scale window branches. We conducted a detailed performance evaluation on three datasets, ImageNet-1K, ADE20K, and COCO. Compared with related state-of-the-art (SoTA) methods, DW-ViT obtains the best performance. Specifically, compared with the current SoTA Swin Transformers \cite{liu2021swin}, DW-ViT has achieved consistent and substantial improvements on all three datasets with similar parameters and computational costs. In addition, DW-ViT exhibits good scalability and can be easily inserted into any window-based visual transformers.

下载PDF全文

下载文献需遵守相关版权规定

论文标题