可区分的软掩盖注意力

论文标题

可区分的软掩盖注意力

Differentiable Soft-Masked Attention

论文作者

Athar, Ali, Luiten, Jonathon, Hermans, Alexander, Ramanan, Deva, Leibe, Bastian

论文摘要

由于其在建模复杂操作方面的性能和灵活性，变压器在计算机视觉中变得普遍。 “交叉注意”操作特别重要，它允许通过参与任意大小的输入特征集来学习一个向量表示（例如，图像中的对象）。最近，提出了“掩盖注意力”，其中给定的对象表示仅关注那些对象的分割掩码处于活动状态的图像像素功能。这种注意力的专业化证明对各种图像和视频细分任务有益。在本文中，我们提出了另一种专业化的注意力，该专业能够通过“软遮罩”（具有连续掩盖概率而不是二进制值的那些软遮罩）参加，并且通过这些掩码概率也可以差异化，从而允许在不需要直接损失监督的情况下学习网络中使用的掩模。这对于多种应用程序可能很有用。具体来说，我们将“可区分的软掩盖注意力”用于弱监督视频对象细分（VOS）的任务，在该任务中，我们为VOS开发了一个基于变压器的网络，该网络仅需要一个带有带注释的图像框架进行培训，但也可以从一个带有一个带注释的框架的视频中受益于对视频的周期一致性培训。尽管没有标记的框架中的口罩没有损失，但由于我们的新型注意力表述，该网络仍然能够在这些框架中细分对象。代码：https：//github.com/ali2500/hodor/blob/main/main/hodor/modelling/encoder/soft_masked_attention.py

Transformers have become prevalent in computer vision due to their performance and flexibility in modelling complex operations. Of particular significance is the 'cross-attention' operation, which allows a vector representation (e.g. of an object in an image) to be learned by attending to an arbitrarily sized set of input features. Recently, "Masked Attention" was proposed in which a given object representation only attends to those image pixel features for which the segmentation mask of that object is active. This specialization of attention proved beneficial for various image and video segmentation tasks. In this paper, we propose another specialization of attention which enables attending over `soft-masks' (those with continuous mask probabilities instead of binary values), and is also differentiable through these mask probabilities, thus allowing the mask used for attention to be learned within the network without requiring direct loss supervision. This can be useful for several applications. Specifically, we employ our "Differentiable Soft-Masked Attention" for the task of Weakly-Supervised Video Object Segmentation (VOS), where we develop a transformer-based network for VOS which only requires a single annotated image frame for training, but can also benefit from cycle consistency training on a video with just one annotated frame. Although there is no loss for masks in unlabeled frames, the network is still able to segment objects in those frames due to our novel attention formulation. Code: https://github.com/Ali2500/HODOR/blob/main/hodor/modelling/encoder/soft_masked_attention.py

下载PDF全文

下载文献需遵守相关版权规定

论文标题