定义：通过分析建模启用DNN加速器的深度优先计划空间的快速探索

论文标题

定义：通过分析建模启用DNN加速器的深度优先计划空间的快速探索

DeFiNES: Enabling Fast Exploration of the Depth-first Scheduling Space for DNN Accelerators through Analytical Modeling

论文作者

Mei, Linyan, Goetschalckx, Koen, Symons, Arne, Verhelst, Marian

论文摘要

DNN的工作负载可以以许多不同的方式安排到DNN加速器上：从逐层调度到跨层深度急切的计划（又称层融合或级联执行）。这导致了一个非常广泛的计划空间，每个时间表都会导致硬件（HW）在能量和延迟方面的成本各不相同。为了快速探索这一广阔的空间，以供各种硬件体系结构，分析成本模型对于估计对HW级别的调度效果至关重要。但是，最先进的成本模型缺乏支持探索完整深度优先的计划空间的支持，例如，仅着眼于激活时，同时忽略权重，或者仅建模DRAM访问，同时忽略芯片上的数据移动。这些限制阻止了研究人员系统，准确地了解深度优先的计划空间。在正式化了此设计空间之后，这项工作提出了一个统一的建模框架，以定义逐层和深度优先计划以填补空白。定义启用可以在分析上估算可能的时间表的硬件成本，以在能量和延迟方面进行，同时考虑每个内存级别的数据访问。这是针对正在研究的每个时间表和HW体系结构完成的，通过最佳选择操作，图层和功能映射图块的独特组合的内存层次结构的活动部分。考虑到数据计算和数据复制阶段，估算了硬件成本。分析成本模型是针对从胶带深度为DNN加速器DEPFIN的测量数据验证的，在端到端神经网络级别显示出良好的建模准确性。与广义最新的比较表明，最多可以找到10倍更好的解决方案。

DNN workloads can be scheduled onto DNN accelerators in many different ways: from layer-by-layer scheduling to cross-layer depth-first scheduling (a.k.a. layer fusion, or cascaded execution). This results in a very broad scheduling space, with each schedule leading to varying hardware (HW) costs in terms of energy and latency. To rapidly explore this vast space for a wide variety of hardware architectures, analytical cost models are crucial to estimate scheduling effects on the HW level. However, state-of-the-art cost models are lacking support for exploring the complete depth-first scheduling space, for instance focusing only on activations while ignoring weights, or modeling only DRAM accesses while overlooking on-chip data movements. These limitations prevent researchers from systematically and accurately understanding the depth-first scheduling space. After formalizing this design space, this work proposes a unified modeling framework, DeFiNES, for layer-by-layer and depth-first scheduling to fill in the gaps. DeFiNES enables analytically estimating the hardware cost for possible schedules in terms of both energy and latency, while considering data access at every memory level. This is done for each schedule and HW architecture under study by optimally choosing the active part of the memory hierarchy per unique combination of operand, layer, and feature map tile. The hardware costs are estimated, taking into account both data computation and data copy phases. The analytical cost model is validated against measured data from a taped-out depth-first DNN accelerator, DepFiN, showing good modeling accuracy at the end-to-end neural network level. A comparison with generalized state-of-the-art demonstrates up to 10X better solutions found with DeFiNES.

下载PDF全文

下载文献需遵守相关版权规定

论文标题