论文标题
Torchsparse:有效的点云推理引擎
TorchSparse: Efficient Point Cloud Inference Engine
论文作者
论文摘要
由于其在AR/VR和自动驾驶中的广泛应用,对Point Clouds的深入学习引起了人们的关注。这些应用需要低延迟和高精度,以提供实时用户体验并确保用户安全。与常规密集的工作量不同,点云的稀疏和不规则性质对在通用硬件上有效地运行稀疏的CNN构成了严重的挑战。此外,2D图像的现有稀疏加速技术不会转化为3D点云。在本文中,我们介绍了Torchsparse,这是一种高性能点云推断引擎,可加速GPU上的稀疏卷积计算。 Torchsparse直接优化了稀疏卷积的两个瓶颈:不规则的计算和数据运动。它将自适应矩阵乘法分组应用于贸易计算,以获得更好的规律性,实现1.4-1.5倍的矩阵乘法加速。它还通过采用矢量化,量化和融合的位置感知内存访问来优化数据移动,从而将内存运动成本降低2.7倍。在三个基准数据集中评估了七个代表性模型,Torchsparse分别在最先进的Minkowskiengine和SPCONV上实现了1.6倍和1.5倍的端到端速度。
Deep learning on point clouds has received increased attention thanks to its wide applications in AR/VR and autonomous driving. These applications require low latency and high accuracy to provide real-time user experience and ensure user safety. Unlike conventional dense workloads, the sparse and irregular nature of point clouds poses severe challenges to running sparse CNNs efficiently on the general-purpose hardware. Furthermore, existing sparse acceleration techniques for 2D images do not translate to 3D point clouds. In this paper, we introduce TorchSparse, a high-performance point cloud inference engine that accelerates the sparse convolution computation on GPUs. TorchSparse directly optimizes the two bottlenecks of sparse convolution: irregular computation and data movement. It applies adaptive matrix multiplication grouping to trade computation for better regularity, achieving 1.4-1.5x speedup for matrix multiplication. It also optimizes the data movement by adopting vectorized, quantized and fused locality-aware memory access, reducing the memory movement cost by 2.7x. Evaluated on seven representative models across three benchmark datasets, TorchSparse achieves 1.6x and 1.5x measured end-to-end speedup over the state-of-the-art MinkowskiEngine and SpConv, respectively.