朝着零拍的跨语性图像检索

论文标题

朝着零拍的跨语性图像检索

Towards Zero-shot Cross-lingual Image Retrieval

论文作者

Aggarwal, Pranav, Kale, Ajinkya

论文摘要

最近，人们对多模式语言和视力问题引起了人们的兴趣。在语言方面，大多数模型主要集中在英语上，因为大多数多模式数据集都是单语的。我们试图使用零拍的方法来弥合这一差距，用于在文本侧使用跨语性预训练来学习多模式表示。我们提出了一种简单但实用的方法，用于构建跨语性图像检索模型，该模型在单语培训数据集上进行训练，但在推断期间可以以零拍的跨语性方式使用。我们还引入了一个新的目标函数，该功能通过彼此推动不同的文本来收紧嵌入群集的文本。最后，我们以7种语言介绍了一种新的1K多语言MSCOCO2014字幕测试数据集（XTD10），我们使用众包平台收集了该语言。我们将其用作测试集，用于评估跨语言的零击模型性能。 XTD10数据集可在此处公开可用：https：//github.com/adobe-research/cross-ligual-test-dataset-xtd10

There has been a recent spike in interest in multi-modal Language and Vision problems. On the language side, most of these models primarily focus on English since most multi-modal datasets are monolingual. We try to bridge this gap with a zero-shot approach for learning multi-modal representations using cross-lingual pre-training on the text side. We present a simple yet practical approach for building a cross-lingual image retrieval model which trains on a monolingual training dataset but can be used in a zero-shot cross-lingual fashion during inference. We also introduce a new objective function which tightens the text embedding clusters by pushing dissimilar texts from each other. Finally, we introduce a new 1K multi-lingual MSCOCO2014 caption test dataset (XTD10) in 7 languages that we collected using a crowdsourcing platform. We use this as the test set for evaluating zero-shot model performance across languages. XTD10 dataset is made publicly available here: https://github.com/adobe-research/Cross-lingual-Test-Dataset-XTD10

下载PDF全文

下载文献需遵守相关版权规定

论文标题