论文标题

时间和空间效率的GIT存储库开采的工具

Tooling for Time- and Space-efficient git Repository Mining

论文作者

Heseding, Fabian, Scheibel, Willy, Döllner, Jürgen

论文摘要

每个提交都会在版本控制下增长的软件项目,每个存储库会积累数十万个提交。特别是对于如此大的项目,静态源代码分析的存储库和数据提取的遍历在粒度和速度之间构成了权衡。我们展示了命令行工具pyrepositoryminer,该工具结合了一组优化方法,可从GIT存储库中提取有效的遍历和数据提取,同时适应第三方和自定义软件指标和数据提取。该工具用Python编写,并结合了裸露的存储库访问,内存存储,并行化,缓存,基于变更的分析,并优化了遍历和自定义数据提取组件之间的通信。该工具允许使用Python编写的指标和外部程序进行数据提取。基于基本采矿用例的单线程性能评估显示,在四个中型开源项目中,其他可自由可用的工具的平均速度为15.6倍。多线程执行允许在内核之间分布加载,因此,使用12个线程的平均速度高达86.9倍。

Software projects under version control grow with each commit, accumulating up to hundreds of thousands of commits per repository. Especially for such large projects, the traversal of a repository and data extraction for static source code analysis poses a trade-off between granularity and speed. We showcase the command-line tool pyrepositoryminer that combines a set of optimization approaches for efficient traversal and data extraction from git repositories while being adaptable to third-party and custom software metrics and data extractions. The tool is written in Python and combines bare repository access, in-memory storage, parallelization, caching, change-based analysis, and optimized communication between the traversal and custom data extraction components. The tool allows for both metrics written in Python and external programs for data extraction. A single-thread performance evaluation based on a basic mining use case shows a mean speedup of 15.6x to other freely available tools across four mid-sized open source projects. A multi-threaded execution allows for load distribution among cores and, thus, a mean speedup up to 86.9x using 12 threads.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源