论文标题
使用多个实例学习构建多模式表示
Using Multiple Instance Learning to Build Multimodal Representations
论文作者
论文摘要
图像文本多模式表示学习使跨模式的数据对齐,并实现重要的医学应用,例如图像分类,视觉接地和跨模式检索。在这项工作中,我们建立了多模式表示学习与多个实例学习之间的联系。基于此连接,我们提出了一个通用框架,用于构建置换不变的分数功能,许多现有的多模式表示学习方法作为特殊情况。此外,我们使用该框架来得出一种新颖的对比学习方法,并证明我们的方法实现了最新的方法,从而导致了几个下游任务。
Image-text multimodal representation learning aligns data across modalities and enables important medical applications, e.g., image classification, visual grounding, and cross-modal retrieval. In this work, we establish a connection between multimodal representation learning and multiple instance learning. Based on this connection, we propose a generic framework for constructing permutation-invariant score functions with many existing multimodal representation learning approaches as special cases. Furthermore, we use the framework to derive a novel contrastive learning approach and demonstrate that our method achieves state-of-the-art results in several downstream tasks.