自动集中以有效的深度学习

论文标题

自动集中以有效的深度学习

Self-Attentive Pooling for Efficient Deep Learning

论文作者

Chen, Fang, Datta, Gourav, Kundu, Souvik, Beerel, Peter

论文摘要

有效的自定义合并技术可以积极地修剪特征图的尺寸，从而减少用于资源约束计算机视觉应用程序的推理计算和内存足迹最近获得了显着的牵引力。但是，先前的合并作品仅提取激活图的局部环境，从而限制了它们的有效性。相比之下，我们提出了一种新型的非本地自我煽动合并方法，该方法可以用作标准合并层的液位替换，例如最大/平均池或稳定的卷积。提出的自我发场模块使用斑块嵌入，多头自我注意力和空间通道恢复，然后进行乙状结肠激活和指数软效果。这种自我注意的机制有效地聚集了在下采样过程中非本地激活斑之间的依赖性。具有各种卷积神经网络（CNN）体系结构的标准对象分类和检测任务的广泛实验证明了我们所提出的机制优于最先进的（SOTA）合并技术。特别是，我们超过了在Imabilenet-V2上不同变体上的现有合并技术的测试准确性，平均平均为1.2％。与具有ISO-MEMORY足迹的SOTA技术相比，随着初始层中激活图的积极下降（可减少22倍）的测试准确性。这使我们的模型可以在内存受限的设备中部署，例如微型控制器（不会丢失明显的精度），因为初始激活图消耗了大量的片上存储器，用于用于复杂视觉任务所需的高分辨率图像。我们提出的合并方法还利用了通道修剪的想法，以进一步减少记忆足迹。

Efficient custom pooling techniques that can aggressively trim the dimensions of a feature map and thereby reduce inference compute and memory footprint for resource-constrained computer vision applications have recently gained significant traction. However, prior pooling works extract only the local context of the activation maps, limiting their effectiveness. In contrast, we propose a novel non-local self-attentive pooling method that can be used as a drop-in replacement to the standard pooling layers, such as max/average pooling or strided convolution. The proposed self-attention module uses patch embedding, multi-head self-attention, and spatial-channel restoration, followed by sigmoid activation and exponential soft-max. This self-attention mechanism efficiently aggregates dependencies between non-local activation patches during down-sampling. Extensive experiments on standard object classification and detection tasks with various convolutional neural network (CNN) architectures demonstrate the superiority of our proposed mechanism over the state-of-the-art (SOTA) pooling techniques. In particular, we surpass the test accuracy of existing pooling techniques on different variants of MobileNet-V2 on ImageNet by an average of 1.2%. With the aggressive down-sampling of the activation maps in the initial layers (providing up to 22x reduction in memory consumption), our approach achieves 1.43% higher test accuracy compared to SOTA techniques with iso-memory footprints. This enables the deployment of our models in memory-constrained devices, such as micro-controllers (without losing significant accuracy), because the initial activation maps consume a significant amount of on-chip memory for high-resolution images required for complex vision tasks. Our proposed pooling method also leverages the idea of channel pruning to further reduce memory footprints.

下载PDF全文

下载文献需遵守相关版权规定

论文标题