大规模GCN的多节点加速度

论文标题

大规模GCN的多节点加速度

Multi-node Acceleration for Large-scale GCNs

论文作者

Sun, Gongjian, Yan, Mingyu, Wang, Duo, Li, Han, Li, Wenming, Ye, Xiaochun, Fan, Dongrui, Xie, Yuan

论文摘要

受内存能力和计算功率的限制，新的节点卷积神经网络（GCN）加速器无法在合理的时间内完成GCN的执行，这是由于图形的爆炸性大小。因此，大规模GCN要求用于大规模神经网络的TPU-POD等多节点加速度系统（MultiaCCSYS）。在这项工作中，我们旨在扩展单节点GCN加速器以在大规模图上加速GCN。我们首先确定大规模图上GCN多节点加速的通信模式和挑战。我们观察到（1）在MultiACCSYS中执行GCN的粗粒沟通模式，该模式引入了大量冗余网络传输和芯片内存储器访问；（2）总体而言，MultiACCSYS中GCN的加速度是带宽界限和耐潜力的。在这两个观察结果的指导下，我们提出了MultigCN，这是大规模GCN的第一个Multiaccsys，它将网络延迟用于网络带宽。具体而言，通过利用网络延迟公差，我们首先提出了一种拓扑感知的多播机制，该机制具有一个由多播消息通话模型的推杆，以减少传输并减轻网络带宽要求。其次，我们介绍了一种基于散点的圆形执行机制，该机制与多播机制合作并减少了冗余的片外存储器访问。与基线MultiACCSYS相比，MultiGCN仅使用28％〜68％的能量实现4〜12X的速度，同时平均减少了32％的传输和73％的芯片内存储器访问。它不仅可以在最先进的多GPU解决方案上实现2.5〜8倍的速度，而且还可以缩放到大规模图，而不是单节点GCN加速器。

Limited by the memory capacity and compute power, singe-node graph convolutional neural network (GCN) accelerators cannot complete the execution of GCNs within a reasonable amount of time, due to the explosive size of graphs nowadays. Thus, large-scale GCNs call for a multi-node acceleration system (MultiAccSys) like TPU-Pod for large-scale neural networks. In this work, we aim to scale up single-node GCN accelerators to accelerate GCNs on large-scale graphs. We first identify the communication pattern and challenges of multi-node acceleration for GCNs on large-scale graphs. We observe that (1) coarse-grained communication patterns exist in the execution of GCNs in MultiAccSys, which introduces massive amount of redundant network transmissions and off-chip memory accesses; (2) overall, the acceleration of GCNs in MultiAccSys is bandwidth-bound and latency-tolerant. Guided by these two observations, we then propose MultiGCN, the first MultiAccSys for large-scale GCNs that trades network latency for network bandwidth. Specifically, by leveraging the network latency tolerance, we first propose a topology-aware multicast mechanism with a one put per multicast message-passing model to reduce transmissions and alleviate network bandwidth requirements. Second, we introduce a scatter-based round execution mechanism which cooperates with the multicast mechanism and reduces redundant off-chip memory accesses. Compared to the baseline MultiAccSys, MultiGCN achieves 4~12x speedup using only 28%~68% energy, while reducing 32% transmissions and 73% off-chip memory accesses on average. It not only achieves 2.5~8x speedup over the state-of-the-art multi-GPU solution, but also scales to large-scale graphs as opposed to single-node GCN accelerators.

下载PDF全文

下载文献需遵守相关版权规定

论文标题