论文标题
跨模式的关注一致性正规化视觉关系对齐
Cross-modal Attention Congruence Regularization for Vision-Language Relation Alignment
论文作者
论文摘要
尽管最近朝着扩展多模式视觉模型的进展,但这些模型仍然在诸如Winoground之类的组成概括基准上挣扎。我们发现,由于当前视觉模型缺乏的关键组成部分是关系级的一致性:与图像中的文本(例如“草中的杯子”)中的定向语义关系匹配的能力(例如,“草中的杯子”)与图像中的空间关系(例如,杯子相对于草的位置)。为了解决这个问题,我们表明,可以通过鼓励从“杯子”到“草”(捕获语义关系''中的“杯子”)的指示语言关注来实现关系对齐,以匹配从杯子到草的有名视觉关注。使用交叉模式的注意力轻轻识别令牌及其相应的对象。我们证明,这种软关系一致性的概念等于在跨模式注意矩阵提供的“基础变化”下,在视觉和语言注意矩阵之间执行一致性。直观地,我们的方法将视觉关注投射到语言注意空间中,以计算其与实际语言关注的差异,反之亦然。我们将跨模式的关注一致性正规化(CACR)损失应用于Uniter,并改善了Winoground的最先进方法。
Despite recent progress towards scaling up multimodal vision-language models, these models are still known to struggle on compositional generalization benchmarks such as Winoground. We find that a critical component lacking from current vision-language models is relation-level alignment: the ability to match directional semantic relations in text (e.g., "mug in grass") with spatial relationships in the image (e.g., the position of the mug relative to the grass). To tackle this problem, we show that relation alignment can be enforced by encouraging the directed language attention from 'mug' to 'grass' (capturing the semantic relation 'in') to match the directed visual attention from the mug to the grass. Tokens and their corresponding objects are softly identified using the cross-modal attention. We prove that this notion of soft relation alignment is equivalent to enforcing congruence between vision and language attention matrices under a 'change of basis' provided by the cross-modal attention matrix. Intuitively, our approach projects visual attention into the language attention space to calculate its divergence from the actual language attention, and vice versa. We apply our Cross-modal Attention Congruence Regularization (CACR) loss to UNITER and improve on the state-of-the-art approach to Winoground.