论文标题
基于投影聚类和舞台式混合采样的重叠定向的不平衡合奏学习方法
Overlapping oriented imbalanced ensemble learning method based on projective clustering and stagewise hybrid sampling
论文作者
论文摘要
不平衡学习的挑战不仅在于阶级不平衡问题,还在于重叠的班级问题,这很复杂。但是,大多数现有算法主要集中于前者。限制阻止了现有方法突破。为了解决这一限制,本文提出了一种基于双重聚类和阶段的混合抽样(DCSH)的集成学习算法。 DCSHS有三个部分。首先,我们设计了一个以戴维斯 - 博丁聚类效果指数(DBI)为指导的投影聚类组合框架(PCC),该指数用于获得高质量的簇并将它们组合起来以获得一组跨组成子集(CCS),并具有平衡的类别和低重叠。其次,根据子集类别的特征,通过阶段的混合采样算法旨在实现子集的去骨和平衡。最后,通过转移学习为所有处理的子集构建了投影聚类转移映射机制(CTM),从而减少了类重叠并探索样品的结构信息。我们的算法的主要优点是,它可以利用CC的交叉性来实现重叠多数样本的软消除,并尽可能多地学习重叠样本的信息,从而增强类重叠的同时平衡同班平衡。在实验部分中,选择了30多个公共数据集和十种代表性算法进行验证。实验结果表明,在各种评估标准方面,DCSHS显然是最好的。
The challenge of imbalanced learning lies not only in class imbalance problem, but also in the class overlapping problem which is complex. However, most of the existing algorithms mainly focus on the former. The limitation prevents the existing methods from breaking through. To address this limitation, this paper proposes an ensemble learning algorithm based on dual clustering and stage-wise hybrid sampling (DCSHS). The DCSHS has three parts. Firstly, we design a projection clustering combination framework (PCC) guided by Davies-Bouldin clustering effectiveness index (DBI), which is used to obtain high-quality clusters and combine them to obtain a set of cross-complete subsets (CCS) with balanced class and low overlapping. Secondly, according to the characteristics of subset classes, a stage-wise hybrid sampling algorithm is designed to realize the de-overlapping and balancing of subsets. Finally, a projective clustering transfer mapping mechanism (CTM) is constructed for all processed subsets by means of transfer learning, thereby reducing class overlapping and explore structure information of samples. The major advantage of our algorithm is that it can exploit the intersectionality of the CCS to realize the soft elimination of overlapping majority samples, and learn as much information of overlapping samples as possible, thereby enhancing the class overlapping while class balancing. In the experimental section, more than 30 public datasets and over ten representative algorithms are chosen for verification. The experimental results show that the DCSHS is significantly best in terms of various evaluation criteria.