与标签噪声的多级不平衡数据的合并清洁和重新采样算法

论文标题

与标签噪声的多级不平衡数据的合并清洁和重新采样算法

Combined Cleaning and Resampling Algorithm for Multi-Class Imbalanced Data with Label Noise

论文作者

Koziarski, Michał, Woźniak, Michał, Krawczyk, Bartosz

论文摘要

数据分类不平衡是面临现代数据分析的最关键任务之一。尤其是当结合其他难度因素（例如存在噪声，重叠的类别分布和小析出）时，数据不平衡会严重影响分类性能。此外，已知某些数据难度因素会影响现有的过采样策略的性能，尤其是Smote及其衍生物。在多级环境中，这种效果尤其明显，在这种环境中，班级之间的相互失衡关系使得更加复杂。尽管如此，数据不平衡领域的大多数当代研究都集中在二进制分类问题上，而它们更困难的多级对应物相对尚未探索。在本文中，我们提出了一种新型的过采样技术，一种多级合并的清洁和重采样（MC-CCR）算法。所提出的方法利用一种基于能量的方法来建模适合过采样的区域，比SMOTE的小分离和异常值的影响较小。它结合了同时清洁操作，其目的是减少重叠类分布对学习算法性能的影响。最后，通过纳入处理多级问题的专门策略，MC-CCR比传统的多级分解策略对阶层间关系的信息丢失的影响较小。根据许多多级不平衡基准数据集进行的实验研究结果，与最先进的方法相比，显示了所提出的噪声方法的高鲁棒，以及其高质量。

The imbalanced data classification is one of the most crucial tasks facing modern data analysis. Especially when combined with other difficulty factors, such as the presence of noise, overlapping class distributions, and small disjuncts, data imbalance can significantly impact the classification performance. Furthermore, some of the data difficulty factors are known to affect the performance of the existing oversampling strategies, in particular SMOTE and its derivatives. This effect is especially pronounced in the multi-class setting, in which the mutual imbalance relationships between the classes complicate even further. Despite that, most of the contemporary research in the area of data imbalance focuses on the binary classification problems, while their more difficult multi-class counterparts are relatively unexplored. In this paper, we propose a novel oversampling technique, a Multi-Class Combined Cleaning and Resampling (MC-CCR) algorithm. The proposed method utilizes an energy-based approach to modeling the regions suitable for oversampling, less affected by small disjuncts and outliers than SMOTE. It combines it with a simultaneous cleaning operation, the aim of which is to reduce the effect of overlapping class distributions on the performance of the learning algorithms. Finally, by incorporating a dedicated strategy of handling the multi-class problems, MC-CCR is less affected by the loss of information about the inter-class relationships than the traditional multi-class decomposition strategies. Based on the results of experimental research carried out for many multi-class imbalanced benchmark datasets, the high robust of the proposed approach to noise was shown, as well as its high quality compared to the state-of-art methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题