论文标题
从网络爬行的图像文本数据中学习噪音吸引的学习图像字幕
Noise-aware Learning from Web-crawled Image-Text Data for Image Captioning
论文作者
论文摘要
图像字幕是可以利用大规模Web爬行数据的直接任务之一,它为字幕模型提供了有关视觉世界的丰富知识。但是,由于网络爬行的数据包含在不同级别对齐的图像文本对,因此固有的噪声(例如,错位对)使得很难学习一个精确的字幕模型。尽管过滤策略可以有效地消除嘈杂的数据,但它会导致可学习知识的减少,有时会带来新的数据缺陷问题。为了尽最大努力,我们提出了一个吸引噪音的字幕(NOC)框架,该框架从整个网络爬行的数据中学习丰富的知识,同时又受到噪音的影响较小。这是由提出的对齐级可控的标题来实现的,该字幕使用了训练期间使用图像文本对的比对水平作为控制信号来学到的。对齐级条件的训练允许模型通过简单地将控制信号设置为推理时间的所需对齐水平来生成高质量的字幕。深入的分析显示了我们框架在处理噪声方面的有效性。通过使用生成的字幕(即自我归还)的两个零拍字字幕和文本对图像检索的任务,我们还证明了我们的模型可以在描述性和独特性方面产生高质量的字幕。该代码可在\ url {https://github.com/kakaobrain/noc}中获得。
Image captioning is one of the straightforward tasks that can take advantage of large-scale web-crawled data which provides rich knowledge about the visual world for a captioning model. However, since web-crawled data contains image-text pairs that are aligned at different levels, the inherent noises (e.g., misaligned pairs) make it difficult to learn a precise captioning model. While the filtering strategy can effectively remove noisy data, it leads to a decrease in learnable knowledge and sometimes brings about a new problem of data deficiency. To take the best of both worlds, we propose a Noise-aware Captioning (NoC) framework, which learns rich knowledge from the whole web-crawled data while being less affected by the noises. This is achieved by the proposed alignment-level-controllable captioner, which is learned using alignment levels of the image-text pairs as a control signal during training. The alignment-level-conditioned training allows the model to generate high-quality captions by simply setting the control signal to the desired alignment level at inference time. An in-depth analysis shows the effectiveness of our framework in handling noise. With two tasks of zero-shot captioning and text-to-image retrieval using generated captions (i.e., self-retrieval), we also demonstrate our model can produce high-quality captions in terms of descriptiveness and distinctiveness. The code is available at \url{https://github.com/kakaobrain/noc}.