变量的聚类以增强预测模型的可解释性

论文标题

变量的聚类以增强预测模型的可解释性

Clustering of variables for enhanced interpretability of predictive models

论文作者

Vigneau, Evelyne

论文摘要

提出了一种新的策略，以在高维数据集的背景下构建易于解释的预测模型，并具有大量高度相关的解释变量。该策略是基于变量聚类的第一步，使用了潜在变量（CLV）方法的变量聚类。为了以群体方式依次选择解释性变量，对层次聚类树状图进行了探索。对于模型设置实现，树状图被用作L2增强过程中的基础学习者。当已经知道簇和预测方程式时，根据玩具模拟的示例，并在研究基于1H-NMR光谱分析的橙汁身份验证的实际案例研究的基础上说明了所提出的方法，称为LMCLV。在这两个说明性示例中，该过程均显示出与其他方法具有相似的预测效率，具有额外的解释能力。它可以在r软件包Clustvarlv中使用。

A new strategy is proposed for building easy to interpret predictive models in the context of a high-dimensional dataset, with a large number of highly correlated explanatory variables. The strategy is based on a first step of variables clustering using the CLustering of Variables around Latent Variables (CLV) method. The exploration of the hierarchical clustering dendrogram is undertaken in order to sequentially select the explanatory variables in a group-wise fashion. For model setting implementation, the dendrogram is used as the base-learner in an L2-boosting procedure. The proposed approach, named lmCLV, is illustrated on the basis of a toy-simulated example when the clusters and predictive equation are already known, and on a real case study dealing with the authentication of orange juices based on 1H-NMR spectroscopic analysis. In both illustrative examples, this procedure was shown to have similar predictive efficiency to other methods, with additional interpretability capacity. It is available in the R package ClustVarLV.

下载PDF全文

下载文献需遵守相关版权规定

论文标题