论文标题
相关信息瓶颈:旨在调整预验证的多模式,以回答可靠的视觉问题
Correlation Information Bottleneck: Towards Adapting Pretrained Multimodal Models for Robust Visual Question Answering
论文作者
论文摘要
受益于大规模的视觉语言模型(VLM),视觉问题答案(VQA)的性能已接近人类甲骨文。但是,在有限数据上对此类模型进行填充通常会遭受过度拟合和概括不良的问题,从而导致缺乏模型鲁棒性。在本文中,我们旨在从信息瓶颈的角度来提高输入鲁棒性,以调整验证的VLMS到下游VQA任务。输入鲁棒性是指模型抵抗视觉和语言输入变化的能力,以及涉及输入中涉及的快捷方式学习。通常,通过验证的VLM获得的表示形式不可避免地包含针对特定下游任务的无关和冗余信息,从而导致统计上的虚假相关性和对输入变化的不敏感性。为了鼓励表示形式收敛到多模式学习中的足够统计量,我们提出了相关信息瓶颈(CIB),该信息通过最大程度地减少输入和表示之间的互惠率在表示形式之间的压缩和冗余之间的权衡,同时最大程度地减少输入和表示之间的MI和表示之间。此外,我们为多模式输入和表示之间的相互信息得出了一个紧密的理论上限,并结合了不同的内部相关性,可以指导模型学习更强大的表示并促进模态对准。广泛的实验始终证明了所提出的CIB的有效性和优越性,从投入鲁棒性和准确性方面。
Benefiting from large-scale pretrained vision language models (VLMs), the performance of visual question answering (VQA) has approached human oracles. However, finetuning such models on limited data often suffers from overfitting and poor generalization issues, leading to a lack of model robustness. In this paper, we aim to improve input robustness from an information bottleneck perspective when adapting pretrained VLMs to the downstream VQA task. Input robustness refers to the ability of models to defend against visual and linguistic input variations, as well as shortcut learning involved in inputs. Generally, the representations obtained by pretrained VLMs inevitably contain irrelevant and redundant information for a specific downstream task, resulting in statistically spurious correlations and insensitivity to input variations. To encourage representations to converge to a minimal sufficient statistic in multimodal learning, we propose Correlation Information Bottleneck (CIB), which seeks a tradeoff between compression and redundancy in representations by minimizing the mutual information (MI) between inputs and representations while maximizing the MI between outputs and representations. Moreover, we derive a tight theoretical upper bound for the mutual information between multimodal inputs and representations, incorporating different internal correlations that guide models to learn more robust representations and facilitate modality alignment. Extensive experiments consistently demonstrate the effectiveness and superiority of the proposed CIB in terms of input robustness and accuracy.