配对的跨模式数据增强，用于细度图像到文本检索

论文标题

配对的跨模式数据增强，用于细度图像到文本检索

Paired Cross-Modal Data Augmentation for Fine-Grained Image-to-Text Retrieval

论文作者

Wang, Hao, Lin, Guosheng, Hoi, Steven C. H., Miao, Chunyan

论文摘要

本文研究了一个开放的研究问题，即生成文本图像对，以改善细粒度对文本跨模式检索任务的训练，并通过发现stylegan2模型的隐藏语义信息，为配对数据增强的新框架提出了一个新颖的框架。具体来说，我们首先在给定数据集上训练stylegan2模型。然后，我们将真实图像投影回stylegan2的潜在空间，以获取潜在的代码。为了使生成的图像可操作，我们进一步引入了一个潜在的空间对齐模块，以了解stylegan2潜在代码和相应的文本字幕功能之间的对齐。当我们进行在线配对数据增强时，我们首先通过随机代码替换生成增强文本，然后将增强文本传递到潜在的空间比对模块中以输出潜在代码，该模块最终被馈送到stylegan2以生成增强图像。我们评估了增强数据方法在两个公共跨模式检索数据集上的功效，其中有希望的实验结果表明，可以将增强的文本图像对数据与原始数据一起培训，以增强图像到文本的交叉模式检索性能。

This paper investigates an open research problem of generating text-image pairs to improve the training of fine-grained image-to-text cross-modal retrieval task, and proposes a novel framework for paired data augmentation by uncovering the hidden semantic information of StyleGAN2 model. Specifically, we first train a StyleGAN2 model on the given dataset. We then project the real images back to the latent space of StyleGAN2 to obtain the latent codes. To make the generated images manipulatable, we further introduce a latent space alignment module to learn the alignment between StyleGAN2 latent codes and the corresponding textual caption features. When we do online paired data augmentation, we first generate augmented text through random token replacement, then pass the augmented text into the latent space alignment module to output the latent codes, which are finally fed to StyleGAN2 to generate the augmented images. We evaluate the efficacy of our augmented data approach on two public cross-modal retrieval datasets, in which the promising experimental results demonstrate the augmented text-image pair data can be trained together with the original data to boost the image-to-text cross-modal retrieval performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题