论文标题
Hard-Odt:硬件友好的在线决策树学习算法和系统
Hard-ODT: Hardware-Friendly Online Decision Tree Learning Algorithm and System
论文作者
论文摘要
决策树是在各种应用程序场景中常用的机器学习模型。在大数据时代,由于其严格的数据存储要求,传统的决策树感应算法不适合学习大规模数据集。在线决策树学习算法已经设计为通过与传入样本同时培训并提供推理结果来解决此问题。但是,即使是最新的在线树学习算法仍然遭受高内存使用情况或高度计算强度的依赖性和长期延迟,这使它们在硬件中实施具有挑战性。为了克服这些困难,我们引入了一种新的基于分数的算法,以改善Hoeffding树的诱导,这是最先进的在线学习模型之一。就记忆和计算需求而言,所提出的算法在保持较高的概括能力方面都是轻量级的。已经从硬件的角度研究了一系列专门针对该算法的优化技术,包括粗粒和细粒并行性,动态和基于内存的资源共享,并通过数据转发进行管道。之后,我们介绍了Hard-Odt,这是具有系统级优化技术的现场可编程门阵列(FPGA)上的高性能,硬件有效的在线决策树学习系统。为完整的学习系统建立了绩效和资源利用,以早日和快速分析各种设计指标之间的权衡。最后,我们提出了一个设计流,其中提出的学习系统被应用于FPGA运行时功率监测作为案例研究。
Decision trees are machine learning models commonly used in various application scenarios. In the era of big data, traditional decision tree induction algorithms are not suitable for learning large-scale datasets due to their stringent data storage requirement. Online decision tree learning algorithms have been devised to tackle this problem by concurrently training with incoming samples and providing inference results. However, even the most up-to-date online tree learning algorithms still suffer from either high memory usage or high computational intensity with dependency and long latency, making them challenging to implement in hardware. To overcome these difficulties, we introduce a new quantile-based algorithm to improve the induction of the Hoeffding tree, one of the state-of-the-art online learning models. The proposed algorithm is light-weight in terms of both memory and computational demand, while still maintaining high generalization ability. A series of optimization techniques dedicated to the proposed algorithm have been investigated from the hardware perspective, including coarse-grained and fine-grained parallelism, dynamic and memory-based resource sharing, pipelining with data forwarding. Following this, we present Hard-ODT, a high-performance, hardware-efficient and scalable online decision tree learning system on a field-programmable gate array (FPGA) with system-level optimization techniques. Performance and resource utilization are modeled for the complete learning system for early and fast analysis of the trade-off between various design metrics. Finally, we propose a design flow in which the proposed learning system is applied to FPGA run-time power monitoring as a case study.