SGAP：为GPU进行有效的稀疏张量代数汇编

论文标题

SGAP：为GPU进行有效的稀疏张量代数汇编

Sgap: Towards Efficient Sparse Tensor Algebra Compilation for GPU

论文作者

Zhang, Genghan, Zhao, Yuetong, Tao, Yanting, Yu, Zhongming, Dai, Guohao, Huang, Sitao, Wen, Yuan, Petoumenos, Pavlos, Wang, Yu

论文摘要

稀疏编译器是稀疏张量代数优化的有前途的解决方案。在编译器实施中，稀疏密集的混合代数的减少在性能中起关键作用。尽管GPU提供了可以更好地利用并行计算和记忆带宽容量的各种还原语义，但中心问题是：如何将灵活的还原语义提升到假定串行执行的稀疏汇编理论。具体而言，我们必须应对两个主要挑战：（1）通过采用静态同步粒度来浪费并行性（2）静态还原策略限制了优化空间探索。我们提出SGAP：段组和原子并行性来解决这些问题。原子平行性捕获了灵活的还原语义，以系统地分析GPU上稀疏致密杂种代数的优化空间。这是一种超出当前编译器和开源运行时库的新优化技术。在稀疏汇编理论中，段组将柔性减少语义提升到合适的抽象水平。它分别采用可变的小组规模和用户定义的减少策略来解决挑战（1）和（2）。最后，我们使用炸玉米饼编译器上的GPU稀疏基质矩阵乘法（SPMM）作为用例，以证明段组在还原语义语义上的有效性。我们在原始炸玉米饼的SPMM内核上达到了1.2倍的速度。我们还应用了原子并行性发现的新优化技术，用于开源的最先进的SPMM库DGSPARSE。我们在使用原子并行性调谐的算法上实现了1.6倍-2.3倍的速度。

Sparse compiler is a promising solution for sparse tensor algebra optimization. In compiler implementation, reduction in sparse-dense hybrid algebra plays a key role in performance. Though GPU provides various reduction semantics that can better utilize the parallel computing and memory bandwidth capacity, the central question is: how to elevate the flexible reduction semantics to sparse compilation theory that assumes serial execution. Specifically, we have to tackle two main challenges: (1) there are wasted parallelism by adopting static synchronization granularity (2) static reduction strategy limits optimization space exploration. We propose Sgap: segment group and atomic parallelism to solve these problems. Atomic parallelism captures the flexible reduction semantics to systematically analyze the optimization space of sparse-dense hybrid algebra on GPU. It is a new optimization technique beyond current compiler-based and open-source runtime libraries. Segment group elevates the flexible reduction semantics to suitable levels of abstraction in the sparse compilation theory. It adopts changeable group size and user-defined reduction strategy to solve challenge (1) and (2), respectively. Finally, we use GPU sparse matrix-matrix multiplication (SpMM) on the TACO compiler as a use case to demonstrate the effectiveness of segment group in reduction semantics elevation. We achieve up to 1.2x speedup over the original TACO's SpMM kernels. We also apply new optimization techniques found by atomic parallelism to an open-source state-of-the-art SpMM library dgSPARSE. We achieve 1.6x - 2.3x speedup on the algorithm tuned with atomic parallelism.

下载PDF全文

下载文献需遵守相关版权规定

论文标题