论文标题
边缘设备的有效图像字幕
Efficient Image Captioning for Edge Devices
论文作者
论文摘要
近年来见证了图像字幕的迅速发展。但是,对大型存储器存储和重大计算负担的需求阻止了这些字幕模型被部署在移动设备上。主要的障碍在于重量级视觉特征提取器(即对象探测器)和复杂的跨模式融合网络。为此,我们提出了LightCap,这是用于资源有限设备的轻质图像标题。核心设计建立在最近的剪辑模型上,用于有效的图像字幕。要具体而言,一方面,我们利用夹子模型来提取紧凑的网格特征,而无需依赖耗时的对象探测器。另一方面,我们通过设计一种新颖的视觉概念提取器和跨模式调制器,将剪辑的图像文本检索设计转移到图像字幕方案。我们通过顺序和集合蒸馏进一步优化了跨模式融合模型和并行预测头。借助精心设计的体系结构,我们的模型仅包含4000万参数,与当前的最新方法相比,将模型大小超过75%,而拖失板则超过98%。尽管容量较低,但我们的模型仍然在流行数据集上表现出最先进的性能,例如,Coco Karpathy测试拆分的136.6苹果酒。建议的LightCap仅使用单个CPU进行测试,每张图像的快速推理速度为188毫秒,已准备好用于实际应用。
Recent years have witnessed the rapid progress of image captioning. However, the demands for large memory storage and heavy computational burden prevent these captioning models from being deployed on mobile devices. The main obstacles lie in the heavyweight visual feature extractors (i.e., object detectors) and complicated cross-modal fusion networks. To this end, we propose LightCap, a lightweight image captioner for resource-limited devices. The core design is built on the recent CLIP model for efficient image captioning. To be specific, on the one hand, we leverage the CLIP model to extract the compact grid features without relying on the time-consuming object detectors. On the other hand, we transfer the image-text retrieval design of CLIP to image captioning scenarios by devising a novel visual concept extractor and a cross-modal modulator. We further optimize the cross-modal fusion model and parallel prediction heads via sequential and ensemble distillations. With the carefully designed architecture, our model merely contains 40M parameters, saving the model size by more than 75% and the FLOPs by more than 98% in comparison with the current state-of-the-art methods. In spite of the low capacity, our model still exhibits state-of-the-art performance on prevalent datasets, e.g., 136.6 CIDEr on COCO Karpathy test split. Testing on the smartphone with only a single CPU, the proposed LightCap exhibits a fast inference speed of 188ms per image, which is ready for practical applications.