论文标题
通过事件摄像机进行对象检测的反复视觉变压器
Recurrent Vision Transformers for Object Detection with Event Cameras
论文作者
论文摘要
我们提出了反复的视觉变压器(RVT),这是一种新型的主链,用于使用事件摄像机进行对象检测。事件摄像机可在高动力范围内具有亚毫秒延迟的视觉信息,并且具有强大的鲁棒性。这些独特的属性为低延迟对象检测和跟踪时间关键时期的情况提供了巨大的潜力。基于事件的视力的先前工作已经达到了出色的检测性能,但以大量推理时间为代价,通常超过40毫秒。通过重新审视经常性视力骨架的高级设计,我们将推理时间减少了6倍,同时保持相似的性能。为了实现这一目标,我们探索了一个多阶段设计,该设计在每个阶段都利用三个关键概念:首先,可以将其视为条件位置嵌入。其次,用于空间特征相互作用的局部和扩张的全球自我注意力。第三,复发时间特征聚集,以最大程度地减少延迟,同时保留时间信息。可以从头开始训练RVT,以达到基于事件的对象检测的最先进性能 - 在Gen1 Automotive数据集中获得47.2%的地图。同时,RVT提供快速推理(T4 GPU上的12毫秒)和有利的参数效率(比以前的ART少5倍)。我们的研究为有效的设计选择带来了新的见解,这些选择可以为基于事件的愿景而超越研究。
We present Recurrent Vision Transformers (RVTs), a novel backbone for object detection with event cameras. Event cameras provide visual information with sub-millisecond latency at a high-dynamic range and with strong robustness against motion blur. These unique properties offer great potential for low-latency object detection and tracking in time-critical scenarios. Prior work in event-based vision has achieved outstanding detection performance but at the cost of substantial inference time, typically beyond 40 milliseconds. By revisiting the high-level design of recurrent vision backbones, we reduce inference time by a factor of 6 while retaining similar performance. To achieve this, we explore a multi-stage design that utilizes three key concepts in each stage: First, a convolutional prior that can be regarded as a conditional positional embedding. Second, local and dilated global self-attention for spatial feature interaction. Third, recurrent temporal feature aggregation to minimize latency while retaining temporal information. RVTs can be trained from scratch to reach state-of-the-art performance on event-based object detection - achieving an mAP of 47.2% on the Gen1 automotive dataset. At the same time, RVTs offer fast inference (<12 ms on a T4 GPU) and favorable parameter efficiency (5 times fewer than prior art). Our study brings new insights into effective design choices that can be fruitful for research beyond event-based vision.