了解跨域的指导图像字幕性能

论文标题

了解跨域的指导图像字幕性能

Understanding Guided Image Captioning Performance across Domains

论文作者

Ng, Edwin G., Pang, Bo, Sharma, Piyush, Soricut, Radu

论文摘要

图像字幕模型通常缺乏考虑用户兴趣的能力，通常默认为试图平衡可读性，信息性和信息过载的全局描述。另一方面，VQA模型通常缺乏提供长期描述性答案的能力，同时期望文本问题非常精确。我们提出了一种控制图像标题应关注的概念的方法，该方法使用了一个称为指导文本的其他输入，该输入指的是图像中的可见或不可接受的概念。我们的模型由一个基于变压器的多模式编码器组成，该编码器将指导文本与全局和对象级图像特征一起使用，以得出用于生成引导标题的早期融合表示。虽然接受视觉基因组数据训练的模型在带有自动对象标签的指导下拟合良好的内域优势，但我们发现，在概念字幕上训练的字幕模型具有指导性的字幕模型，可以更好地推广到室外图像和指导文本上。我们的人为评估结果表明，尝试在野外引导图像字幕上需要访问大型，不受限制的域培训数据集，并且增加了风格多样性（即使不增加独特的代币数量）也是提高性能的关键因素。

Image captioning models generally lack the capability to take into account user interest, and usually default to global descriptions that try to balance readability, informativeness, and information overload. On the other hand, VQA models generally lack the ability to provide long descriptive answers, while expecting the textual question to be quite precise. We present a method to control the concepts that an image caption should focus on, using an additional input called the guiding text that refers to either groundable or ungroundable concepts in the image. Our model consists of a Transformer-based multimodal encoder that uses the guiding text together with global and object-level image features to derive early-fusion representations used to generate the guided caption. While models trained on Visual Genome data have an in-domain advantage of fitting well when guided with automatic object labels, we find that guided captioning models trained on Conceptual Captions generalize better on out-of-domain images and guiding texts. Our human-evaluation results indicate that attempting in-the-wild guided image captioning requires access to large, unrestricted-domain training datasets, and that increased style diversity (even without increasing the number of unique tokens) is a key factor for improved performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题