论文标题
PointClip V2:提示剪辑和GPT进行功能强大的3D开放世界学习
PointCLIP V2: Prompting CLIP and GPT for Powerful 3D Open-world Learning
论文作者
论文摘要
大规模的预训练模型已显示出有希望的视觉和语言任务的开放性能。但是,它们在3D点云上的传输能力仍然有限,仅限于分类任务。在本文中,我们首先协作剪辑和GPT成为统一的3D开放世界学习者,命名为PointClip V2,该学习者完全释放了其对零摄像的3D分类,细分和检测的潜力。为了更好地将3D数据与预训练的语言知识保持一致,PointClip V2包含两个关键设计。对于视觉末端,我们通过形状投影模块提示剪辑以生成更真实的深度图,从而用自然图像缩小了投影点云之间的域间隙。对于文本末端,我们提示GPT模型生成3D特定文本作为剪辑的文本编码器的输入。没有在3D域中进行任何训练,我们的方法可显着超过 +42.90%, +40.44%和 +28.75%精度的三个数据集,以进行零弹射3D分类。最重要的是,V2可以以简单的方式扩展到少数射击3D分类,零射击3D零件分割和3D对象检测,这证明了我们为统一的3D开放世界学习的概括能力。
Large-scale pre-trained models have shown promising open-world performance for both vision and language tasks. However, their transferred capacity on 3D point clouds is still limited and only constrained to the classification task. In this paper, we first collaborate CLIP and GPT to be a unified 3D open-world learner, named as PointCLIP V2, which fully unleashes their potential for zero-shot 3D classification, segmentation, and detection. To better align 3D data with the pre-trained language knowledge, PointCLIP V2 contains two key designs. For the visual end, we prompt CLIP via a shape projection module to generate more realistic depth maps, narrowing the domain gap between projected point clouds with natural images. For the textual end, we prompt the GPT model to generate 3D-specific text as the input of CLIP's textual encoder. Without any training in 3D domains, our approach significantly surpasses PointCLIP by +42.90%, +40.44%, and +28.75% accuracy on three datasets for zero-shot 3D classification. On top of that, V2 can be extended to few-shot 3D classification, zero-shot 3D part segmentation, and 3D object detection in a simple manner, demonstrating our generalization ability for unified 3D open-world learning.