callipepla：以溪流为中心的指令集和混合精度，用于加速共轭梯度求解器

论文标题

callipepla：以溪流为中心的指令集和混合精度，用于加速共轭梯度求解器

Callipepla: Stream Centric Instruction Set and Mixed Precision for Accelerating Conjugate Gradient Solver

论文作者

Song, Linghao, Guo, Licheng, Basalama, Suhail, Chi, Yuze, Lucas, Robert F., Cong, Jason

论文摘要

FPGA的处理能力的持续增长以及高带宽记忆（HBM）使线性求解器的Xilinx U280可靠平台之类的系统通常主导着科学和工程应用的运行时间。在本文中，我们提出了Callipepla，这是预处理共轭梯度线性求解器（CG）的加速器。 CG的FPGA加速面临三个挑战：（1）如何支持任意问题并即时终止加速处理，（2）如何协调处理模块之间的长矢量数据流，以及（3）如何节省芯片内存储器带宽并维持双重（FP64）精确度。为了应对这三个挑战，我们提出了（1）一个以溪流为中心的指令集，用于有效的流处理和控制，（2）向量流进行重复使用（VSR）和分散的矢量流程安排，以协调模块之间的矢量数据流程，并进一步降低具有双重内存通道设计的芯片内存访问的芯片访问，并保持了双重内存频道设计，并且仍然可以提供综合精度，但要实现双重精度。据我们所知，这是第一项介绍VSR概念的工作，以在片上模块之间重复使用数据，以减少FPGA加速器的不必要的外芯片访问。我们在Xilinx U280 HBM FPGA上原型原型。我们的评估表明，与Xilinx HPC产品相比，XCGSolver Callipepla的速度达到3.94倍，吞吐量高3.36倍，并且2.94倍提高能效。与Callipepla的内存带宽4倍的NVIDIA A100 GPU相比，我们仍然可以实现其吞吐量的77％，其能源效率提高3.34倍。该代码可在https://github.com/ucla-vast/callipepla上找到。

The continued growth in the processing power of FPGAs coupled with high bandwidth memories (HBM), makes systems like the Xilinx U280 credible platforms for linear solvers which often dominate the run time of scientific and engineering applications. In this paper, we present Callipepla, an accelerator for a preconditioned conjugate gradient linear solver (CG). FPGA acceleration of CG faces three challenges: (1) how to support an arbitrary problem and terminate acceleration processing on the fly, (2) how to coordinate long-vector data flow among processing modules, and (3) how to save off-chip memory bandwidth and maintain double (FP64) precision accuracy. To tackle the three challenges, we present (1) a stream-centric instruction set for efficient streaming processing and control, (2) vector streaming reuse (VSR) and decentralized vector flow scheduling to coordinate vector data flow among modules and further reduce off-chip memory accesses with a double memory channel design, and (3) a mixed precision scheme to save bandwidth yet still achieve effective double precision quality solutions. To the best of our knowledge, this is the first work to introduce the concept of VSR for data reusing between on-chip modules to reduce unnecessary off-chip accesses for FPGA accelerators. We prototype the accelerator on a Xilinx U280 HBM FPGA. Our evaluation shows that compared to the Xilinx HPC product, the XcgSolver, Callipepla achieves a speedup of 3.94x, 3.36x higher throughput, and 2.94x better energy efficiency. Compared to an NVIDIA A100 GPU which has 4x the memory bandwidth of Callipepla, we still achieve 77% of its throughput with 3.34x higher energy efficiency. The code is available at https://github.com/UCLA-VAST/Callipepla.

下载PDF全文

下载文献需遵守相关版权规定

论文标题