通过生成对抗网络进行文本对图像合成的视觉语言匹配

论文标题

通过生成对抗网络进行文本对图像合成的视觉语言匹配

Vision-Language Matching for Text-to-Image Synthesis via Generative Adversarial Networks

论文作者

Cheng, Qingrong, Wen, Keyu, Gu, Xiaodong

论文摘要

文本对图像综合旨在从特定文本描述中生成光真逼真和语义一致的图像。与相应的图像和文本描述相比，由现成模型合成的图像通常包含有限的组件，从而降低了图像质量和文本 - 视觉一致性。为了解决这个问题，我们提出了一种新颖的视觉语言匹配策略，用于文本对图像综合，名为Vlmgan*，该策略介绍了双重视觉语言匹配机制，以增强图像质量和语义一致性。双重视觉匹配机制考虑了生成的图像与相应的文本描述之间的文本 - 视觉匹配，以及综合图像和真实图像之间的视觉视觉视觉一致约束。给定特定的文本描述，vlmgan*首先将其编码为文本功能，然后将它们馈送到基于双视觉匹配的生成模型中，以合成光合逼真和文本的语义一致图像。此外，文本对图像合成的流行评估指标是从简单图像生成中借来的，该图像生成主要评估合成图像的现实和多样性。因此，我们引入了一个名为“视觉语言匹配得分”（VLM）的度量标准，以评估文本对图像合成的性能，该表现可以考虑综合图像和描述之间的图像质量和语义一致性。提出的双重多级视觉匹配策略可以应用于其他文本对图像合成方法。我们在两个流行的基线上实现了此策略，这些基线用$ {\ text {vlmgan} _ {+\ text {attngan}}} $和$ {\ text {vlmangan} _ {+\ fext {+\ text {+\ text {dfgan}}} $。两个广泛使用数据集的实验结果表明，该模型比其他最先进的方法实现了重大改进。

Text-to-image synthesis aims to generate a photo-realistic and semantic consistent image from a specific text description. The images synthesized by off-the-shelf models usually contain limited components compared with the corresponding image and text description, which decreases the image quality and the textual-visual consistency. To address this issue, we propose a novel Vision-Language Matching strategy for text-to-image synthesis, named VLMGAN*, which introduces a dual vision-language matching mechanism to strengthen the image quality and semantic consistency. The dual vision-language matching mechanism considers textual-visual matching between the generated image and the corresponding text description, and visual-visual consistent constraints between the synthesized image and the real image. Given a specific text description, VLMGAN* firstly encodes it into textual features and then feeds them to a dual vision-language matching-based generative model to synthesize a photo-realistic and textual semantic consistent image. Besides, the popular evaluation metrics for text-to-image synthesis are borrowed from simple image generation, which mainly evaluates the reality and diversity of the synthesized images. Therefore, we introduce a metric named Vision-Language Matching Score (VLMS) to evaluate the performance of text-to-image synthesis which can consider both the image quality and the semantic consistency between synthesized image and the description. The proposed dual multi-level vision-language matching strategy can be applied to other text-to-image synthesis methods. We implement this strategy on two popular baselines, which are marked with ${\text{VLMGAN}_{+\text{AttnGAN}}}$ and ${\text{VLMGAN}_{+\text{DFGAN}}}$. The experimental results on two widely-used datasets show that the model achieves significant improvements over other state-of-the-art methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题