基于跨模式相似性的课程学习图像字幕

论文标题

基于跨模式相似性的课程学习图像字幕

Cross-Modal Similarity-Based Curriculum Learning for Image Captioning

论文作者

Zhang, Hongkuan, Sugawara, Saku, Aizawa, Akiko, Zhou, Lei, Sasano, Ryohei, Takeda, Koichi

论文摘要

图像字幕模型需要高级概括能力，以描述单词中各种图像的内容。大多数现有的方法在训练中同时对待图像对对，而无需考虑其学习困难的差异。几种图像字幕方法介绍了课程学习方法，以提高难度水平的培训数据。但是，它们的难度测量要么基于特定于域的特征或先前的模型培训。在本文中，我们提出了一个简单而有效的难度测量，用于使用预验证的视觉模型计算出的交叉模式相似性来进行图像字幕。可可和Flickr30k数据集的实验表明，我们提出的方法在不需要启发式或产生额外的培训费用的情况下，达到了卓越的性能和竞争趋于速度。此外，在困难示例和看不见的数据上的较高模型性能也证明了概括能力。

Image captioning models require the high-level generalization ability to describe the contents of various images in words. Most existing approaches treat the image-caption pairs equally in their training without considering the differences in their learning difficulties. Several image captioning approaches introduce curriculum learning methods that present training data with increasing levels of difficulty. However, their difficulty measurements are either based on domain-specific features or prior model training. In this paper, we propose a simple yet efficient difficulty measurement for image captioning using cross-modal similarity calculated by a pretrained vision-language model. Experiments on the COCO and Flickr30k datasets show that our proposed approach achieves superior performance and competitive convergence speed to baselines without requiring heuristics or incurring additional training costs. Moreover, the higher model performance on difficult examples and unseen data also demonstrates the generalization ability.

下载PDF全文

下载文献需遵守相关版权规定

论文标题