论文标题
班级不平衡纠正风险预测模型的危害:使用逻辑回归的插图和模拟
The harm of class imbalance corrections for risk prediction models: illustration and simulation using logistic regression
论文作者
论文摘要
纠正级别不平衡的方法,即结果事件频率和非事件之间的不平衡,正在接受开发预测模型的兴趣。我们研究了在歧视,校准和分类方面,检查了不平衡校正对标准和惩罚(Ridge)逻辑回归模型的性能的影响。我们使用Monte Carlo模拟检查了随机不足的采样,随机过采样和SMOTE,以及有关卵巢癌诊断的案例研究。结果表明,所有不平衡校正方法都导致校准差(强烈高估了属于少数群体的概率),但不能更好地歧视接收器操作特征曲线下的区域。不平衡校正在灵敏度和特异性方面改善了分类,但是通过移动概率阈值而获得了相似的结果。我们的研究表明,结果失衡本身并不是问题,并且不平衡校正甚至可能会使模型性能恶化。
Methods to correct class imbalance, i.e. imbalance between the frequency of outcome events and non-events, are receiving increasing interest for developing prediction models. We examined the effect of imbalance correction on the performance of standard and penalized (ridge) logistic regression models in terms of discrimination, calibration, and classification. We examined random undersampling, random oversampling and SMOTE using Monte Carlo simulations and a case study on ovarian cancer diagnosis. The results indicated that all imbalance correction methods led to poor calibration (strong overestimation of the probability to belong to the minority class), but not to better discrimination in terms of the area under the receiver operating characteristic curve. Imbalance correction improved classification in terms of sensitivity and specificity, but similar results were obtained by shifting the probability threshold instead. Our study shows that outcome imbalance is not a problem in itself, and that imbalance correction may even worsen model performance.