论文标题
评估模型的鲁棒性和稳定性到数据集偏移
Evaluating Model Robustness and Stability to Dataset Shift
论文作者
论文摘要
随着在高影响力领域中使用机器学习的使用变得广泛,评估安全性的重要性也有所提高。一个重要的方面是评估模型对设置或总体变化的鲁棒性,这通常需要将模型应用于多个独立的数据集。由于收集此类数据集的成本通常是令人难以置信的,因此在本文中,我们提出了一个使用可用数据来分析这种稳定性的框架。我们使用原始评估数据来确定该算法的性能较差的分布,并估算算法在“最坏情况”分布上的性能。我们考虑使用用户定义的条件分布的变化,从而使一些分布可以移动,同时将数据分布的其他部分固定。例如,在医疗保健环境中,这使我们能够考虑临床实践的转变,同时保持患者人数固定。为了解决与复杂,高维分布中估计相关的挑战,我们得出了一个“偏见”估计器,该估计量保持$ \ sqrt {n} $ - 即使使用较慢收敛速率的机器学习方法用于估计滋扰参数。在有关实际医学风险预测任务的实验中,我们表明该估计器可用于分析稳定性,并解释现实的转变,而现实的转变以前无法表达。拟议的框架使从业人员可以主动评估其模型的安全性,而无需进行其他数据收集。
As the use of machine learning in high impact domains becomes widespread, the importance of evaluating safety has increased. An important aspect of this is evaluating how robust a model is to changes in setting or population, which typically requires applying the model to multiple, independent datasets. Since the cost of collecting such datasets is often prohibitive, in this paper, we propose a framework for analyzing this type of stability using the available data. We use the original evaluation data to determine distributions under which the algorithm performs poorly, and estimate the algorithm's performance on the "worst-case" distribution. We consider shifts in user defined conditional distributions, allowing some distributions to shift while keeping other portions of the data distribution fixed. For example, in a healthcare context, this allows us to consider shifts in clinical practice while keeping the patient population fixed. To address the challenges associated with estimation in complex, high-dimensional distributions, we derive a "debiased" estimator which maintains $\sqrt{N}$-consistency even when machine learning methods with slower convergence rates are used to estimate the nuisance parameters. In experiments on a real medical risk prediction task, we show this estimator can be used to analyze stability and accounts for realistic shifts that could not previously be expressed. The proposed framework allows practitioners to proactively evaluate the safety of their models without requiring additional data collection.