论文标题
均匀的掩蔽术在视觉预读图中占上风
Uniform Masking Prevails in Vision-Language Pretraining
论文作者
论文摘要
事实证明,蒙版语言建模(MLM)是视觉语言(VL)预处理的重要组成部分。要实施传销,研究人员必须做出两个设计选择:掩盖策略,该策略确定了要掩盖的令牌以及掩盖率,该策略确定了掩盖多少令牌。以前的工作主要集中在掩盖策略上,同时将掩盖率设置为15 \%。在本文中,我们表明,提高这种掩盖率可以提高下游性能,同时减少不同掩盖策略之间的性能差距,从而使统一的掩蔽策略与其他更复杂的策略竞争。出乎意料的是,我们还发现,增加掩蔽率会导致图像文本匹配(ITM)任务的提高,这表明MLM的作用超出了VL预测中的语言建模。
Masked Language Modeling (MLM) has proven to be an essential component of Vision-Language (VL) pretraining. To implement MLM, the researcher must make two design choices: the masking strategy, which determines which tokens to mask, and the masking rate, which determines how many tokens to mask. Previous work has focused primarily on the masking strategy while setting the masking rate at a default of 15\%. In this paper, we show that increasing this masking rate improves downstream performance while simultaneously reducing performance gap among different masking strategies, rendering the uniform masking strategy competitive to other more complex ones. Surprisingly, we also discover that increasing the masking rate leads to gains in Image-Text Matching (ITM) tasks, suggesting that the role of MLM goes beyond language modeling in VL pretraining.