结构化知识蒸馏朝着高效而紧凑的多视图3D检测

论文标题

结构化知识蒸馏朝着高效而紧凑的多视图3D检测

Structured Knowledge Distillation Towards Efficient and Compact Multi-View 3D Detection

论文作者

Zhang, Linfeng, Shi, Yukang, Tai, Hung-Shuo, Zhang, Zhipeng, He, Yuan, Wang, Ke, Ma, Kaisheng

论文摘要

从多视图图像中检测3D对象是3D计算机视觉中的一个基本问题。最近，在多视图3D检测任务中取得了重大突破。但是，这些视觉BEV（鸟眼视图）检测模型的前所未有的检测性能伴随着巨大的参数和计算，这使得它们在边缘设备上无法承受。为了解决这个问题，在本文中，我们提出了一个结构化的知识蒸馏框架，旨在提高现代视觉的BEV检测模型的效率。提出的框架主要包括：（a）时空蒸馏，该蒸馏器将教师从不同的时间戳和视图中提炼信息融合知识，（（b）BEV响应蒸馏蒸馏，这些蒸馏蒸馏将教师对不同支柱的响应提炼，以及（c）重量临界，这可以解决现代型变形金学架构中学生和老师之间不一致的投入问题的问题。实验结果表明，我们的方法导致Nuscenes基准的平均改善2.16 MAP和2.27 NDS，从而超过了多个基线的幅度。

Detecting 3D objects from multi-view images is a fundamental problem in 3D computer vision. Recently, significant breakthrough has been made in multi-view 3D detection tasks. However, the unprecedented detection performance of these vision BEV (bird's-eye-view) detection models is accompanied with enormous parameters and computation, which make them unaffordable on edge devices. To address this problem, in this paper, we propose a structured knowledge distillation framework, aiming to improve the efficiency of modern vision-only BEV detection models. The proposed framework mainly includes: (a) spatial-temporal distillation which distills teacher knowledge of information fusion from different timestamps and views, (b) BEV response distillation which distills teacher response to different pillars, and (c) weight-inheriting which solves the problem of inconsistent inputs between students and teacher in modern transformer architectures. Experimental results show that our method leads to an average improvement of 2.16 mAP and 2.27 NDS on the nuScenes benchmark, outperforming multiple baselines by a large margin.

下载PDF全文

下载文献需遵守相关版权规定

论文标题