论文标题
取消基础学习者以提高统计 - 并应用分配回归
Deselection of Base-Learners for Statistical Boosting -- with an Application to Distributional Regression
论文作者
论文摘要
我们提出了一个新的程序,用于增强变量选择,以提高组件的梯度。统计提升是一种来自机器学习的计算方法,它允许在存在高维数据的情况下拟合回归模型。此外,该算法可以导致数据驱动的变量选择。但是,实际上,最终模型通常在某些情况下倾向于包含太多变量。这尤其是对于低维数据(P <n)的情况,在该数据中,我们观察到缓慢的提升行为。结果,更多的变量将包含在最终模型中,而不会改变预测准确性。这些假阳性中的许多都以较小的系数合并,因此影响很小,但导致更大的模型。我们试图通过使该算法有机会取消重要性的基础学习者的机会来克服这个问题。与替代方法相比,我们分析了新方法对可变选择和预测性能的影响,包括提高较早的停止以及双重提升。我们用正在进行的慢性肾脏疾病患者进行的同类研究研究的数据来说明我们的方法,在这些患者中,基于Beta回归的分布回归方法,选择了与健康相关的生活质量测量的最具影响力的预测因素。
We present a new procedure for enhanced variable selection for component-wise gradient boosting. Statistical boosting is a computational approach that emerged from machine learning, which allows to fit regression models in the presence of high-dimensional data. Furthermore, the algorithm can lead to data-driven variable selection. In practice, however, the final models typically tend to include too many variables in some situations. This occurs particularly for low-dimensional data (p<n), where we observe a slow overfitting behavior of boosting. As a result, more variables get included into the final model without altering the prediction accuracy. Many of these false positives are incorporated with a small coefficient and therefore have a small impact, but lead to a larger model. We try to overcome this issue by giving the algorithm the chance to deselect base-learners with minor importance. We analyze the impact of the new approach on variable selection and prediction performance in comparison to alternative methods including boosting with earlier stopping as well as twin boosting. We illustrate our approach with data of an ongoing cohort study for chronic kidney disease patients, where the most influential predictors for the health-related quality of life measure are selected in a distributional regression approach based on beta regression.