多模式贺卡数据集的弱监督注释

论文标题

多模式贺卡数据集的弱监督注释

Weakly Supervised Annotations for Multi-modal Greeting Cards Dataset

论文作者

Hanif, Sidra, Latecki, Longin Jan

论文摘要

近年来，越来越多的预训练模型在大量数据语料库上训练，并在各种任务上产生良好的性能，例如对多模式数据集进行分类。这些模型在自然图像上表现出良好的性能，但并未完全探索图像中稀缺的抽象概念。在这项工作中，我们介绍了一个称为贺卡的基于图像/文本的数据集。具有抽象视觉概念的数据集（GCD）。在我们的工作中，我们建议从验证的图像和文本嵌入中汇总特征，以从GCD中学习抽象的视觉概念。这使我们能够学习文本修饰的图像功能，这些功能将多模式数据流中的互补和冗余信息结合到一个有意义的功能中。其次，使用基于剪辑的图像字幕模型计算GCD数据集的字幕。最后，我们还证明了所提出的数据集也可用于使用预训练的文本对图像生成模型生成贺卡图像。

In recent years, there is a growing number of pre-trained models trained on a large corpus of data and yielding good performance on various tasks such as classifying multimodal datasets. These models have shown good performance on natural images but are not fully explored for scarce abstract concepts in images. In this work, we introduce an image/text-based dataset called Greeting Cards. Dataset (GCD) that has abstract visual concepts. In our work, we propose to aggregate features from pretrained images and text embeddings to learn abstract visual concepts from GCD. This allows us to learn the text-modified image features, which combine complementary and redundant information from the multi-modal data streams into a single, meaningful feature. Secondly, the captions for the GCD dataset are computed with the pretrained CLIP-based image captioning model. Finally, we also demonstrate that the proposed the dataset is also useful for generating greeting card images using pre-trained text-to-image generation model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题