论文标题
建立用于深度学习建议模型培训GPU的绩效模型
Building a Performance Model for Deep Learning Recommendation Model Training on GPUs
论文作者
论文摘要
我们设计了一个用于深度学习推荐模型(DLRM)GPU培训的性能模型,与其他良好优化的CV和NLP模型相比,GPU利用率较低。我们表明,设备有效时间(内核运行时间的总和),但设备空闲时间也是整个设备时间的重要组成部分。因此,我们通过(1)灵活地采用基于启发式和ML的基于ML的内核性能模型,以占主导设备活动时间的操作员,以及(2)将操作员划分为五种类型,以定量确定其对设备活动时间的贡献。结合了这两个部分,我们提出了一种基于关键路径的算法,以通过遍历其执行图来预测DLRM的每批训练时间。在所有内核性能建模中,我们达到了小于10%的几何平均误差(GMAE),而GPU主动时间和总体E2E每批次训练时间预测分别通过单个工作负载的间接开销,而GPU主动时间和总体E2E每批训练时间预测为4.61%和7.96%。 E2E预测误差略有增加2.19%,跨工作负载共享开销表明,在大规模预测中使用共享开销的可行性。我们表明,我们的一般性能模型不仅在DLRM上达到了低预测误差,该模型具有高度自定义的配置,并由多种因素主导,而且在大多数以前方法针对的其他计算机结合的ML模型上都可以产生可比精度。使用此性能模型和图形数据和任务依赖性分析,我们显示我们的系统可以比以前的方法提供更多的通用模型系统共同设计。
We devise a performance model for GPU training of Deep Learning Recommendation Models (DLRM), whose GPU utilization is low compared to other well-optimized CV and NLP models. We show that both the device active time (the sum of kernel runtimes) but also the device idle time are important components of the overall device time. We therefore tackle them separately by (1) flexibly adopting heuristic-based and ML-based kernel performance models for operators that dominate the device active time, and (2) categorizing operator overheads into five types to determine quantitatively their contribution to the device active time. Combining these two parts, we propose a critical-path-based algorithm to predict the per-batch training time of DLRM by traversing its execution graph. We achieve less than 10% geometric mean average error (GMAE) in all kernel performance modeling, and 4.61% and 7.96% geomean errors for GPU active time and overall E2E per-batch training time prediction with overheads from individual workloads, respectively. A slight increase of 2.19% incurred in E2E prediction error with shared overheads across workloads suggests the feasibility of using shared overheads in large-scale prediction. We show that our general performance model not only achieves low prediction error on DLRM, which has highly customized configurations and is dominated by multiple factors but also yields comparable accuracy on other compute-bound ML models targeted by most previous methods. Using this performance model and graph-level data and task dependency analysis, we show our system can provide more general model-system co-design than previous methods.