论文标题
双重方式提示对视觉的预训练模型调整
Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model
论文作者
论文摘要
随着大型预训练的Vison语言模型(例如剪辑)的出现,可通过及时调整可以将可转移表示形式适应多种下游任务。及时调整试图从预训练模型中存储的常识中探索有益信息,以获取下游任务。最近提出的名为“上下文优化”(COP)的方法将一组可学习的向量引入了语言方面的文本提示。但是,单独调整文本提示性只能调整合成的“分类器”,而图像编码器的计算视觉特征则不会受到影响,从而导致了亚最佳解决方案。在本文中,我们通过同时学习文本和视觉提示提出了一种新颖的双模式提示调整(DPT)范式。为了使最终图像特征更多地集中在目标视觉概念上,在我们的DPT中进一步提出了类吸引的视觉及时调整(CAVPT)方案,在该方案中,通过在文本提示功能和图像贴片嵌入式之间执行交叉注意,从而动态地生成了类吸引的视觉提示,以编码下游任务与下游任务相关的信息和视觉实例信息。 11个数据集的广泛实验结果证明了该方法的有效性和泛化能力。我们的代码可在https://github.com/fanrena/dpt中找到。
With the emergence of large pre-trained vison-language model like CLIP, transferable representations can be adapted to a wide range of downstream tasks via prompt tuning. Prompt tuning tries to probe the beneficial information for downstream tasks from the general knowledge stored in the pre-trained model. A recently proposed method named Context Optimization (CoOp) introduces a set of learnable vectors as text prompt from the language side. However, tuning the text prompt alone can only adjust the synthesized "classifier", while the computed visual features of the image encoder can not be affected , thus leading to sub-optimal solutions. In this paper, we propose a novel Dual-modality Prompt Tuning (DPT) paradigm through learning text and visual prompts simultaneously. To make the final image feature concentrate more on the target visual concept, a Class-Aware Visual Prompt Tuning (CAVPT) scheme is further proposed in our DPT, where the class-aware visual prompt is generated dynamically by performing the cross attention between text prompts features and image patch token embeddings to encode both the downstream task-related information and visual instance information. Extensive experimental results on 11 datasets demonstrate the effectiveness and generalization ability of the proposed method. Our code is available in https://github.com/fanrena/DPT.