使用机器学习的工作负载感知的DRAM错误预测

论文标题

使用机器学习的工作负载感知的DRAM错误预测

Workload-Aware DRAM Error Prediction using Machine Learning

论文作者

Mukhanov, Lev, Tovletoglou, Konstantinos, Vandierendonck, Hans, Nikolopoulos, Dimitrios S., Karakonstantis, Georgios

论文摘要

技术的积极缩放可能有助于满足对更高记忆能力和密度的不断增长的需求，但也使DRAM细胞更容易出现错误。这样的现实引发了人们对建模DRAM行为的极大兴趣，以预测提前预测错误或调整DRAM电路参数，以在能源效率和可靠性之间实现更好的权衡。现有的建模工作可能已经研究了使用自定义FPGAS设置的几乎没有操作参数和温度对DRAM可靠性的影响，但是他们忽略了仅在实际系统上系统地研究特定于工作负载特征的组合效果。在本文中，我们介绍了考虑各种操作参数（例如刷新率，电压和温度）的真实服务器中有关工作负载依赖的DRAM错误行为的研究结果。我们表明，单位错误和多位错误的速率可能会随着8倍而变化，这表明程序固有的功能可以显着影响DRAM可靠性。基于此观察结果，我们从各种计算密集型，缓存和分析基准中提取249个功能，例如内存访问率，缓存率，记忆重用时间和数据熵。我们使用几种监督的学习方法来使用内存操作参数和提取的程序固有功能来构建72服务器级DRAM芯片的DRAM错误行为模型。我们的结果表明，通过适当选择程序功能和监督学习方法，可以预测特定的DRAM模块的单位和多位误差率，平均误差小于10.5％，而不是传统的工作负载 - 纳瓦尔错误模型获得的2.9倍估计误差。

The aggressive scaling of technology may have helped to meet the growing demand for higher memory capacity and density, but has also made DRAM cells more prone to errors. Such a reality triggered a lot of interest in modeling DRAM behavior for either predicting the errors in advance or for adjusting DRAM circuit parameters to achieve a better trade-off between energy efficiency and reliability. Existing modeling efforts may have studied the impact of few operating parameters and temperature on DRAM reliability using custom FPGAs setups, however they neglected the combined effect of workload-specific features that can be systematically investigated only on a real system. In this paper, we present the results of our study on workload-dependent DRAM error behavior within a real server considering various operating parameters, such as the refresh rate, voltage and temperature. We show that the rate of single- and multi-bit errors may vary across workloads by 8x, indicating that program inherent features can affect DRAM reliability significantly. Based on this observation, we extract 249 features, such as the memory access rate, the rate of cache misses, the memory reuse time and data entropy, from various compute-intensive, caching and analytics benchmarks. We apply several supervised learning methods to construct the DRAM error behavior model for 72 server-grade DRAM chips using the memory operating parameters and extracted program inherent features. Our results show that, with an appropriate choice of program features and supervised learning method, the rate of single- and multi-bit errors can be predicted for a specific DRAM module with an average error of less than 10.5 %, as opposed to the 2.9x estimation error obtained for a conventional workload-unaware error model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题