论文标题

使用图形数据和算法的特征基于机器学习的图形分区策略选择

Machine Learning-based Selection of Graph Partitioning Strategy Using the Characteristics of Graph Data and Algorithm

论文作者

Park, YoungJoon, Lee, DongKyu, Bui, Tien-Cuong

论文摘要

分析大图数据是许多现代应用程序(例如社交网络)的重要组成部分。由于其较大的计算复杂性,经常采用分布式处理。这需要将图形数据划分为节点,并且分区策略的选择对任务的执行时间有很大的影响。但是,在任意图形数据和算法上,没有一定大小的分区策略。策略的性能取决于图数据和算法的特征。此外,由于图数据和算法的复杂性,手动识别最佳分区策略也是不可行的。在这项工作中,我们提出了一种基于机器学习的方法,以选择给定图和处理算法的最合适的分区策略。我们的方法列举了可行的分区策略,可以预测每种目标算法的执行时间,并以最快的估计执行时间选择分区策略。我们的机器学习模型经过从图形数据和算法伪代码中提取的功能进行培训。我们还提出了一种方法,该方法可以增加图形任务的真实执行日志以创建一个大型合成数据集。评估结果表明,与分区策略的平均执行时间相比,我们方法选择的策略平均得出1.46倍的执行时间,而与最佳分配策略相比,性能约为0.95倍。

Analyzing large graph data is an essential part of many modern applications, such as social networks. Due to its large computational complexity, distributed processing is frequently employed. This requires graph data to be divided across nodes, and the choice of partitioning strategy has a great impact on the execution time of the task. Yet, there is no one-size-fits-all partitioning strategy that performs well on arbitrary graph data and algorithms. The performance of a strategy depends on the characteristics of the graph data and algorithms. Moreover, due to the complexity of graph data and algorithms, manually identifying the best partitioning strategy is also infeasible. In this work, we propose a machine learning-based approach to select the most appropriate partitioning strategy for a given graph and processing algorithm. Our approach enumerates viable partitioning strategies, predicts the execution time of the target algorithm for each, and selects the partitioning strategy with the fastest estimated execution time. Our machine learning model is trained on features extracted from graph data and algorithm pseudo-code. We also propose a method that augments real execution logs of graph tasks to create a large synthetic dataset. Evaluation results show that the strategies selected by our approach lead to 1.46X faster execution time on average compared with the mean execution time of the partitioning strategies and about 0.95X the performance compared to the best partitioning strategy.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源