带有变压器的一般多标签图像分类

论文标题

带有变压器的一般多标签图像分类

General Multi-label Image Classification with Transformers

论文作者

Lanchantin, Jack, Wang, Tianlu, Ordonez, Vicente, Qi, Yanjun

论文摘要

多标签图像分类是预测与图像中存在的对象，属性或其他实体相对应的一组标签。在这项工作中，我们提出了分类变压器（C-TRAN），这是一个多标签图像分类的一般框架，它利用变压器在视觉特征和标签之间利用复杂的依赖性。我们的方法由一个训练有素的变压器编码器组成，可以预测一组目标标签，并在给定的一组掩盖标签中，以及来自卷积神经网络的视觉特征。我们方法的关键要素是标签掩模训练目标，该目标使用三元编码方案将标签状态表示为训练期间的正，负或未知。我们的模型显示了在挑战性数据集（例如可可和视觉基因组）上的最新性能。此外，由于我们的模型明确表示训练期间标签的不确定性，因此，通过允许我们为推理过程中具有部分或额外标签注释的图像产生改进的结果，这是更笼统的。我们在可可，视觉基因组，News500和Cub Image数据集中演示了这种额外功能。

Multi-label image classification is the task of predicting a set of labels corresponding to objects, attributes or other entities present in an image. In this work we propose the Classification Transformer (C-Tran), a general framework for multi-label image classification that leverages Transformers to exploit the complex dependencies among visual features and labels. Our approach consists of a Transformer encoder trained to predict a set of target labels given an input set of masked labels, and visual features from a convolutional neural network. A key ingredient of our method is a label mask training objective that uses a ternary encoding scheme to represent the state of the labels as positive, negative, or unknown during training. Our model shows state-of-the-art performance on challenging datasets such as COCO and Visual Genome. Moreover, because our model explicitly represents the uncertainty of labels during training, it is more general by allowing us to produce improved results for images with partial or extra label annotations during inference. We demonstrate this additional capability in the COCO, Visual Genome, News500, and CUB image datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题