论文标题
朝着零拍的跨语性图像检索
Towards Zero-shot Cross-lingual Image Retrieval
论文作者
论文摘要
最近,人们对多模式语言和视力问题引起了人们的兴趣。在语言方面,大多数模型主要集中在英语上,因为大多数多模式数据集都是单语的。我们试图使用零拍的方法来弥合这一差距,用于在文本侧使用跨语性预训练来学习多模式表示。我们提出了一种简单但实用的方法,用于构建跨语性图像检索模型,该模型在单语培训数据集上进行训练,但在推断期间可以以零拍的跨语性方式使用。我们还引入了一个新的目标函数,该功能通过彼此推动不同的文本来收紧嵌入群集的文本。最后,我们以7种语言介绍了一种新的1K多语言MSCOCO2014字幕测试数据集(XTD10),我们使用众包平台收集了该语言。我们将其用作测试集,用于评估跨语言的零击模型性能。 XTD10数据集可在此处公开可用:https://github.com/adobe-research/cross-ligual-test-dataset-xtd10
There has been a recent spike in interest in multi-modal Language and Vision problems. On the language side, most of these models primarily focus on English since most multi-modal datasets are monolingual. We try to bridge this gap with a zero-shot approach for learning multi-modal representations using cross-lingual pre-training on the text side. We present a simple yet practical approach for building a cross-lingual image retrieval model which trains on a monolingual training dataset but can be used in a zero-shot cross-lingual fashion during inference. We also introduce a new objective function which tightens the text embedding clusters by pushing dissimilar texts from each other. Finally, we introduce a new 1K multi-lingual MSCOCO2014 caption test dataset (XTD10) in 7 languages that we collected using a crowdsourcing platform. We use this as the test set for evaluating zero-shot model performance across languages. XTD10 dataset is made publicly available here: https://github.com/adobe-research/Cross-lingual-Test-Dataset-XTD10