论文标题
Reval:一个Python软件包,用于确定具有稳定性相对聚类验证的最佳聚类解决方案
reval: a Python package to determine best clustering solutions with stability-based relative clustering validation
论文作者
论文摘要
确定数据集的最佳分区可能是一项具有挑战性的任务,因为1)在无监督的学习框架内缺乏先验信息; 2)没有独特的聚类验证方法来评估聚类解决方案。在这里,我们提出Reval:Python软件包,该软件包利用基于稳定性的相对聚类验证方法来确定最佳聚类解决方案,因为该解决方案是最能推广到看不见数据的解决方案。在R和Python中,统计软件通常依赖于内部验证指标,例如Silhouette,以选择最适合数据的群集数量。同时,缺乏轻松实施相对聚类技术的开源软件解决方案。内部验证方法利用数据本身的特征产生结果,而相对方法则试图利用未知的潜在数据点分布,以寻求可概括和可复制的结果。相对验证方法的实施可以通过丰富可用于研究聚类的已经可用的方法来进一步促进聚类的理论,以在不同的情况下和不同的数据分布中结果。这项工作旨在通过开发一种基于稳定性的方法来为这项工作做出贡献,该方法将最佳的聚类解决方案选择作为通过监督学习,在看不见的数据子集上复制的解决方案。该软件包可用于多个聚类和分类算法,因此允许标记过程的自动化以及评估不同聚类机制的稳定性。
Determining the best partition for a dataset can be a challenging task because of 1) the lack of a priori information within an unsupervised learning framework; and 2) the absence of a unique clustering validation approach to evaluate clustering solutions. Here we present reval: a Python package that leverages stability-based relative clustering validation methods to determine best clustering solutions as the ones that best generalize to unseen data. Statistical software, both in R and Python, usually rely on internal validation metrics, such as silhouette, to select the number of clusters that best fits the data. Meanwhile, open-source software solutions that easily implement relative clustering techniques are lacking. Internal validation methods exploit characteristics of the data itself to produce a result, whereas relative approaches attempt to leverage the unknown underlying distribution of data points looking for generalizable and replicable results. The implementation of relative validation methods can further the theory of clustering by enriching the already available methods that can be used to investigate clustering results in different situations and for different data distributions. This work aims at contributing to this effort by developing a stability-based method that selects the best clustering solution as the one that replicates, via supervised learning, on unseen subsets of data. The package works with multiple clustering and classification algorithms, hence allowing both the automatization of the labeling process and the assessment of the stability of different clustering mechanisms.