异质人工智能工作负载的精确能耗测量

论文标题

异质人工智能工作负载的精确能耗测量

Precise Energy Consumption Measurements of Heterogeneous Artificial Intelligence Workloads

论文作者

Caspart, René, Ziegler, Sebastian, Weyrauch, Arvid, Obermaier, Holger, Raffeiner, Simon, Schuhmacher, Leon Pascal, Scholtyssek, Jan, Trofimova, Darya, Nolden, Marco, Reinartz, Ines, Isensee, Fabian, Götz, Markus, Debus, Charlotte

论文摘要

随着近年来AI的兴起以及模型的复杂性的增加，计算资源的需求不断增长，构成了重大挑战。越来越有效的加速器和使用大型计算簇的使用越来越多。但是，在分布式和加速系统中训练的大型模型的预测准确性的提高是以大幅增加的能源需求增加的代价，研究人员已经开始质疑这种AI方法的环境友好性。因此，能源效率对于AI模型开发人员和基础设施运营商都起着重要作用。 AI工作负载的能耗取决于模型实现和使用的硬件。因此，对不同类型的计算节点上AI工作流的功率抽取的准确测量是算法改进以及未来计算簇和硬件的设计的关键。为此，我们介绍了深度学习模型在不同类型的计算节点上的两个典型应用的能源消耗的测量值。我们的结果表明，1。直接从运行时得出能源消耗不是准确的，但是需要考虑有关其组成的计算节点的消耗； 2。混合节点上忽略加速器硬件会导致对能源消耗的生产效率过高； 3。应分别考虑模型训练和推理的能耗 - 而对GPU的训练优于所有其他节点类型的运行时和能耗，CPU节点的推断可能是相当有效的。我们方法的一个优点是，超级计算机的所有用户都可以使用有关能耗的信息，从而可以轻松地转移到其他工作负载，并提高用户对能源消耗的意识。

With the rise of AI in recent years and the increase in complexity of the models, the growing demand in computational resources is starting to pose a significant challenge. The need for higher compute power is being met with increasingly more potent accelerators and the use of large compute clusters. However, the gain in prediction accuracy from large models trained on distributed and accelerated systems comes at the price of a substantial increase in energy demand, and researchers have started questioning the environmental friendliness of such AI methods at scale. Consequently, energy efficiency plays an important role for AI model developers and infrastructure operators alike. The energy consumption of AI workloads depends on the model implementation and the utilized hardware. Therefore, accurate measurements of the power draw of AI workflows on different types of compute nodes is key to algorithmic improvements and the design of future compute clusters and hardware. To this end, we present measurements of the energy consumption of two typical applications of deep learning models on different types of compute nodes. Our results indicate that 1. deriving energy consumption directly from runtime is not accurate, but the consumption of the compute node needs to be considered regarding its composition; 2. neglecting accelerator hardware on mixed nodes results in overproportional inefficiency regarding energy consumption; 3. energy consumption of model training and inference should be considered separately - while training on GPUs outperforms all other node types regarding both runtime and energy consumption, inference on CPU nodes can be comparably efficient. One advantage of our approach is that the information on energy consumption is available to all users of the supercomputer, enabling an easy transfer to other workloads alongside a raise in user-awareness of energy consumption.

下载PDF全文

下载文献需遵守相关版权规定

论文标题