少时：关于半监督软件缺陷预测变量的“共同训练”的价值

论文标题

少时：关于半监督软件缺陷预测变量的“共同训练”的价值

When Less is More: On the Value of "Co-training" for Semi-Supervised Software Defect Predictors

论文作者

Majumder, Suvodeep, Chakraborty, Joymallya, Menzies, Tim

论文摘要

标记模块有缺陷或非缺陷是一项昂贵的任务。因此，经常有限制标记的数据可用于培训。半监督分类器在培训模型中使用的标签较少。但是，有许多半监督的方法，包括自我标记，共同训练，最大值和基于图形的方法。仅在SE中测试了（例如）预测缺陷，甚至在那里，这些方法仅在SE中进行了测试，这些方法仅在少数项目上进行了测试。本文在714个项目中应用了55个半监督学习者的广泛范围。我们发现，半监督的“共同训练方法”的工作量明显优于其他方法。具体来说，在标签后， 2.5％的数据，然后对使用100％数据的预测进行预测。也就是说，需要谨慎使用共同培训，因为需要根据用户的特定目标仔细选择共同训练方法的特定选择。另外，我们警告说，一种常用的共同训练方法（“多视图” - 不同的学习者获得不同的列集）并不能改善预测（同时在运行时增加了太多的时间，而花费过多，而1.8小时）。这是一个值得将来的工作的空旷的问题，可以测试在软件分析的其他领域中是否可以看到这些减少。为了协助探索其他领域，可以在https://github.com/ai-se/semi-superishis上获得所有使用的代码。

Labeling a module defective or non-defective is an expensive task. Hence, there are often limits on how much-labeled data is available for training. Semi-supervised classifiers use far fewer labels for training models. However, there are numerous semi-supervised methods, including self-labeling, co-training, maximal-margin, and graph-based methods, to name a few. Only a handful of these methods have been tested in SE for (e.g.) predicting defects and even there, those methods have been tested on just a handful of projects. This paper applies a wide range of 55 semi-supervised learners to over 714 projects. We find that semi-supervised "co-training methods" work significantly better than other approaches. Specifically, after labeling, just 2.5% of data, then make predictions that are competitive to those using 100% of the data. That said, co-training needs to be used cautiously since the specific choice of co-training methods needs to be carefully selected based on a user's specific goals. Also, we warn that a commonly-used co-training method ("multi-view"-- where different learners get different sets of columns) does not improve predictions (while adding too much to the run time costs 11 hours vs. 1.8 hours). It is an open question, worthy of future work, to test if these reductions can be seen in other areas of software analytics. To assist with exploring other areas, all the codes used are available at https://github.com/ai-se/Semi-Supervised.

下载PDF全文

下载文献需遵守相关版权规定

论文标题