论文标题
丢失的数据聚类:鲁宾规则的哪种等效物?
Clustering with missing data: which equivalent for Rubin's rules?
论文作者
论文摘要
多重插补(MI)是处理缺失值的流行方法。但是,在MI之后使用聚类的合适方法尚不清楚:如何汇总分区?当数据不完整时,如何评估聚类不稳定?通过回答这两个问题,本文提出了使用MI缺少数据的聚类的完整视图。这里使用共识聚类来解决分区汇总的问题,而基于自举理论,我们解释了如何评估与观察到和缺失数据相关的不稳定性。理论上通过仿真对合并分区和不稳定性评估的新规则进行了论证和广泛的研究。分区汇总可以提高准确性,同时通过缺少数据来测量不稳定性,扩大了数据分析的可能性:它允许评估聚类对归档模型的依赖性,以及在数据不完整时选择集群数量的便利方法,如在真实数据集上所示。
Multiple imputation (MI) is a popular method for dealing with missing values. However, the suitable way for applying clustering after MI remains unclear: how to pool partitions? How to assess the clustering instability when data are incomplete? By answering both questions, this paper proposed a complete view of clustering with missing data using MI. The problem of partitions pooling is here addressed using consensus clustering while, based on the bootstrap theory, we explain how to assess the instability related to observed and missing data. The new rules for pooling partitions and instability assessment are theoretically argued and extensively studied by simulation. Partitions pooling improves accuracy, while measuring instability with missing data enlarges the data analysis possibilities: it allows assessment of the dependence of the clustering to the imputation model, as well as a convenient way for choosing the number of clusters when data are incomplete, as illustrated on a real data set.