论文标题

P2M-DETRACK:用于节能和实时多对象检测和跟踪的内存中的像素

P2M-DeTrack: Processing-in-Pixel-in-Memory for Energy-efficient and Real-Time Multi-Object Detection and Tracking

论文作者

Datta, Gourav, Kundu, Souvik, Yin, Zihan, Mathai, Joe, Liu, Zeyu, Wang, Zixu, Tian, Mulin, Lu, Shunlin, Lakkireddy, Ravi T., Schmidt, Andrew, Abd-Almageed, Wael, Jacob, Ajey P., Jaiswal, Akhilesh R., Beerel, Peter A.

论文摘要

当今的高分辨率,自动驾驶汽车中的高帧速率摄像机会产生大量数据,需要由下游处理器或机器学习(ML)加速器传输和处理,以启用智能计算任务,例如多对象检测和跟踪。大量的数据传输会导致大量的能量,潜伏期和带宽瓶颈,从而阻碍了实时处理。为了减轻此问题,我们提出了一个算法 - 硬件共同设计框架,称为“基于内存”的对象检测和跟踪(P2M-DETRACK)。 P2M-DETRACK基于一个自定义的基于R-CNN的自定义模型,该模型部分分布在像素阵列(前端)内,部分分布在单独的FPGA/ASIC(后端)中。提议的前像素处理下样本的输入特征在明智优化的卷积和汇总时大大绘制了输入特征。与传统的基线设计将RGB像素的帧传输到后端相比,所得的P2M-DETRACK设计将传感器和后端之间的数据带宽降低了24倍。这些设计还将每帧22Nm技术节点的传感器和总能量(从内部基金会的内部电路模拟获得)减少5.7倍和1.14倍。最后,它们分别将传感和总框架潜伏期减少了约1.7倍和3倍。我们在大规模BDD100K数据集的多对象对象检测(跟踪)任务上评估了我们的方法,与最先进的ART相比,平均平均精度降低了0.5%(标识F1分数降低0.8%)。

Today's high resolution, high frame rate cameras in autonomous vehicles generate a large volume of data that needs to be transferred and processed by a downstream processor or machine learning (ML) accelerator to enable intelligent computing tasks, such as multi-object detection and tracking. The massive amount of data transfer incurs significant energy, latency, and bandwidth bottlenecks, which hinders real-time processing. To mitigate this problem, we propose an algorithm-hardware co-design framework called Processing-in-Pixel-in-Memory-based object Detection and Tracking (P2M-DeTrack). P2M-DeTrack is based on a custom faster R-CNN-based model that is distributed partly inside the pixel array (front-end) and partly in a separate FPGA/ASIC (back-end). The proposed front-end in-pixel processing down-samples the input feature maps significantly with judiciously optimized strided convolution and pooling. Compared to a conventional baseline design that transfers frames of RGB pixels to the back-end, the resulting P2M-DeTrack designs reduce the data bandwidth between sensor and back-end by up to 24x. The designs also reduce the sensor and total energy (obtained from in-house circuit simulations at Globalfoundries 22nm technology node) per frame by 5.7x and 1.14x, respectively. Lastly, they reduce the sensing and total frame latency by an estimated 1.7x and 3x, respectively. We evaluate our approach on the multi-object object detection (tracking) task of the large-scale BDD100K dataset and observe only a 0.5% reduction in the mean average precision (0.8% reduction in the identification F1 score) compared to the state-of-the-art.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源