D $^3 $ ETR：用于检测变压器的解码器蒸馏

论文标题

D $^3 $ ETR：用于检测变压器的解码器蒸馏

D$^3$ETR: Decoder Distillation for Detection Transformer

论文作者

Chen, Xiaokang, Chen, Jiahui, Liu, Yan, Zeng, Gang

论文摘要

尽管基于CNN的检测器中的各种知识蒸馏（KD）方法显示出它们在改善小学生方面的有效性，但尚未建立基于DETR的检测器的基线和食谱。在本文中，我们关注基于DITR的检测器的变压器解码器，并为其探索KD方法。变压器解码器的输出以随机顺序为单位，这在教师和学生的预测之间没有直接的对应关系，从而对知识蒸馏构成了挑战。为此，我们建议MixMatcher将基于DEDR的教师和学生的解码器输出保持一致，这将两种教师学生匹配策略（即自适应匹配和固定匹配）混合在一起。具体而言，自适应匹配适用双分性匹配，以适应每个解码器层中的老师和学生的输出，而固定匹配的固定固定了教师和学生使用相同对象查询的学生之间的对应关系，并与教师的固定对象进行了验证，以供应到学生作为Auxiliary群体的解码器。基于MixMatcher，我们构建\ textBf {d} ecoder \ textbf {d} \ textbf {de} tection \ textbf {trextbf {tr} ansformer（d $^3 $ etr）的降低，从而在解码器的预测中提取知识，并从解码器的预测和老师向学生提供了注意图。 D $^3 $ ETR在具有不同骨架的各种基于DITR的探测器上显示出卓越的性能。例如，d $^3 $ etr通过$ \ textbf {7.8}/\ textbf {2.4} $ MAP以$ 12/50 $ epochs培训设置为老师。

While various knowledge distillation (KD) methods in CNN-based detectors show their effectiveness in improving small students, the baselines and recipes for DETR-based detectors are yet to be built. In this paper, we focus on the transformer decoder of DETR-based detectors and explore KD methods for them. The outputs of the transformer decoder lie in random order, which gives no direct correspondence between the predictions of the teacher and the student, thus posing a challenge for knowledge distillation. To this end, we propose MixMatcher to align the decoder outputs of DETR-based teachers and students, which mixes two teacher-student matching strategies, i.e., Adaptive Matching and Fixed Matching. Specifically, Adaptive Matching applies bipartite matching to adaptively match the outputs of the teacher and the student in each decoder layer, while Fixed Matching fixes the correspondence between the outputs of the teacher and the student with the same object queries, with the teacher's fixed object queries fed to the decoder of the student as an auxiliary group. Based on MixMatcher, we build \textbf{D}ecoder \textbf{D}istillation for \textbf{DE}tection \textbf{TR}ansformer (D$^3$ETR), which distills knowledge in decoder predictions and attention maps from the teachers to students. D$^3$ETR shows superior performance on various DETR-based detectors with different backbones. For example, D$^3$ETR improves Conditional DETR-R50-C5 by $\textbf{7.8}/\textbf{2.4}$ mAP under $12/50$ epochs training settings with Conditional DETR-R101-C5 as the teacher.

下载PDF全文

下载文献需遵守相关版权规定

论文标题