论文标题
大规模GCN的多节点加速度
Multi-node Acceleration for Large-scale GCNs
论文作者
论文摘要
受内存能力和计算功率的限制,新的节点卷积神经网络(GCN)加速器无法在合理的时间内完成GCN的执行,这是由于图形的爆炸性大小。因此,大规模GCN要求用于大规模神经网络的TPU-POD等多节点加速度系统(MultiaCCSYS)。在这项工作中,我们旨在扩展单节点GCN加速器以在大规模图上加速GCN。我们首先确定大规模图上GCN多节点加速的通信模式和挑战。我们观察到(1)在MultiACCSYS中执行GCN的粗粒沟通模式,该模式引入了大量冗余网络传输和芯片内存储器访问; (2)总体而言,MultiACCSYS中GCN的加速度是带宽界限和耐潜力的。在这两个观察结果的指导下,我们提出了MultigCN,这是大规模GCN的第一个Multiaccsys,它将网络延迟用于网络带宽。具体而言,通过利用网络延迟公差,我们首先提出了一种拓扑感知的多播机制,该机制具有一个由多播消息通话模型的推杆,以减少传输并减轻网络带宽要求。其次,我们介绍了一种基于散点的圆形执行机制,该机制与多播机制合作并减少了冗余的片外存储器访问。与基线MultiACCSYS相比,MultiGCN仅使用28%〜68%的能量实现4〜12X的速度,同时平均减少了32%的传输和73%的芯片内存储器访问。它不仅可以在最先进的多GPU解决方案上实现2.5〜8倍的速度,而且还可以缩放到大规模图,而不是单节点GCN加速器。
Limited by the memory capacity and compute power, singe-node graph convolutional neural network (GCN) accelerators cannot complete the execution of GCNs within a reasonable amount of time, due to the explosive size of graphs nowadays. Thus, large-scale GCNs call for a multi-node acceleration system (MultiAccSys) like TPU-Pod for large-scale neural networks. In this work, we aim to scale up single-node GCN accelerators to accelerate GCNs on large-scale graphs. We first identify the communication pattern and challenges of multi-node acceleration for GCNs on large-scale graphs. We observe that (1) coarse-grained communication patterns exist in the execution of GCNs in MultiAccSys, which introduces massive amount of redundant network transmissions and off-chip memory accesses; (2) overall, the acceleration of GCNs in MultiAccSys is bandwidth-bound and latency-tolerant. Guided by these two observations, we then propose MultiGCN, the first MultiAccSys for large-scale GCNs that trades network latency for network bandwidth. Specifically, by leveraging the network latency tolerance, we first propose a topology-aware multicast mechanism with a one put per multicast message-passing model to reduce transmissions and alleviate network bandwidth requirements. Second, we introduce a scatter-based round execution mechanism which cooperates with the multicast mechanism and reduces redundant off-chip memory accesses. Compared to the baseline MultiAccSys, MultiGCN achieves 4~12x speedup using only 28%~68% energy, while reducing 32% transmissions and 73% off-chip memory accesses on average. It not only achieves 2.5~8x speedup over the state-of-the-art multi-GPU solution, but also scales to large-scale graphs as opposed to single-node GCN accelerators.