论文标题

最佳树选择通过袋外评估和子袋进行分类

Optimal trees selection for classification via out-of-bag assessment and sub-bagging

论文作者

Khan, Zardad, Gul, Naz, Faiz, Nosheen, Gul, Asma, Adler, Werner, Lausen, Berthold

论文摘要

在过去的二十年中,对培训数据大小对机器学习方法的影响进行了很好的研究。通常,随着训练数据的规模的增加,基于树木的机器学习方法的预测性能会随着速度的降低而提高。我们在最佳树集合(OTE)中调查了这一点,该方法由于内部验证而无法从某些训练观察结果中学习。因此,提出了修改的树选择方法,以使OTE适应内部验证中训练观测值的丧失。在第一种方法中,在每棵树的个人和集体绩效评估中都使用相应的外面(OOB)观测值。树木根据其在OOB观察结果上的个人表现进行排名。选择了一定数量的顶级树,并从最准确的树开始,然后逐一添加随后的树,并通过使用从添加树的Bootstrap样品中遗漏的OOB观测值来记录其影响。如果树提高了整体的预测精度,则选择一棵树。在第二种方法中,将树木在随机子集上生长,而无需替换为子弹,而不是训练数据,而不是bootstrap样品(用替换为替换)。每个样本的其余观察结果都用于与第一种方法相似的每个相应树的个人和集体评估中。对21个基准数据集和仿真研究的分析表明,与OTE和其他最先进的方法相比,修饰方法的性能提高了。

The effect of training data size on machine learning methods has been well investigated over the past two decades. The predictive performance of tree based machine learning methods, in general, improves with a decreasing rate as the size of training data increases. We investigate this in optimal trees ensemble (OTE) where the method fails to learn from some of the training observations due to internal validation. Modified tree selection methods are thus proposed for OTE to cater for the loss of training observations in internal validation. In the first method, corresponding out-of-bag (OOB) observations are used in both individual and collective performance assessment for each tree. Trees are ranked based on their individual performance on the OOB observations. A certain number of top ranked trees is selected and starting from the most accurate tree, subsequent trees are added one by one and their impact is recorded by using the OOB observations left out from the bootstrap sample taken for the tree being added. A tree is selected if it improves predictive accuracy of the ensemble. In the second approach, trees are grown on random subsets, taken without replacement-known as sub-bagging, of the training data instead of bootstrap samples (taken with replacement). The remaining observations from each sample are used in both individual and collective assessments for each corresponding tree similar to the first method. Analysis on 21 benchmark datasets and simulations studies show improved performance of the modified methods in comparison to OTE and other state-of-the-art methods.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源