使用沙普利值有效地对人口特征的非参数统计学推断

论文标题

使用沙普利值有效地对人口特征的非参数统计学推断

Efficient nonparametric statistical inference on population feature importance using Shapley values

论文作者

Williamson, Brian D., Feng, Jean

论文摘要

变量在预测任务中的真正人群级别的重要性提供了有关基础数据生成机制的有用知识，并可以帮助决定在随后的实验中收集哪些测量值。对这一重要性的有效统计推断是理解关注人群的关键组成部分。我们提出了一个计算有效的程序，用于估计和获得对Shapley人口可变重要性度量（SPVIM）的有效统计推断。尽管True SPVIM的计算复杂性与变量的数量成倍缩放，但我们提出了一个基于仅随机采样$θ（n）$特征子集给定$ n $观察值的估算器。我们证明我们的估计器以渐近最佳速率收敛。此外，通过得出估计器的渐近分布，我们构建有效的置信区间和假设检验。我们的过程在模拟中具有良好的有限样本性能，对于院内死亡率预测任务应用于应用不同的机器学习算法时产生相似的可变估计。

The true population-level importance of a variable in a prediction task provides useful knowledge about the underlying data-generating mechanism and can help in deciding which measurements to collect in subsequent experiments. Valid statistical inference on this importance is a key component in understanding the population of interest. We present a computationally efficient procedure for estimating and obtaining valid statistical inference on the Shapley Population Variable Importance Measure (SPVIM). Although the computational complexity of the true SPVIM scales exponentially with the number of variables, we propose an estimator based on randomly sampling only $Θ(n)$ feature subsets given $n$ observations. We prove that our estimator converges at an asymptotically optimal rate. Moreover, by deriving the asymptotic distribution of our estimator, we construct valid confidence intervals and hypothesis tests. Our procedure has good finite-sample performance in simulations, and for an in-hospital mortality prediction task produces similar variable importance estimates when different machine learning algorithms are applied.

下载PDF全文

下载文献需遵守相关版权规定

论文标题