论文标题

在线AUC优化稀疏高维数据集

Online AUC Optimization for Sparse High-Dimensional Datasets

论文作者

Zhou, Baojian, Ying, Yiming, Skiena, Steven

论文摘要

ROC曲线(AUC)下的面积是一种广泛使用的性能度量,用于不平衡分类,该分类是由许多应用稀疏数据丰富的应用领域引起的。在这种情况下,每个$ d $ dimensional样本只有$ k $ non-Zero具有$ k \ ll d $的功能,并且数据以流式形式顺序到达。当前的在线AUC优化算法具有较高的每卷成本$ \ MATHCAL {O}(D)$,通常会生成非SPARSE解决方案,因此不适合处理上面提到的数据挑战。 在本文中,我们旨在在线学习设置下直接优化高维稀疏数据集的AUC分数,并提出一种新算法,\ textsc {ftrl-auc}。我们提出的算法可以以在线方式处理数据,每题的价格便宜得多,$ \ Mathcal {o}(k)$,使其适合于高维稀疏流数据分析。我们的新算法设计在很大程度上取决于对U-Statistical AUC目标功能作为经验鞍点重新印度的重新重新重新制定,以及“懒更新”规则的创新介绍,以使痛苦复杂性大大降低了$ \ \ \ \ \ \ \\ Mathcal {O}(d)$} $} $ f $ \ nathcalcalcal {o} $} $} $} $}(k)(此外,\ textsc {ftrl-auc}可以通过应用一般的遵循指定的领导者(FTRL)框架来固有地捕获稀疏性。 现实世界数据集的实验表明,\ textsc {ftrl-auc}显着改善了运行时间和模型稀疏性,同时与最先进的方法相比,达到竞争性AUC得分。与在线学习方法进行比较的物流损失方法表明,\ textsc {ftrl-auc}达到更高的AUC分数,尤其是当数据集不平衡时。

The Area Under the ROC Curve (AUC) is a widely used performance measure for imbalanced classification arising from many application domains where high-dimensional sparse data is abundant. In such cases, each $d$ dimensional sample has only $k$ non-zero features with $k \ll d$, and data arrives sequentially in a streaming form. Current online AUC optimization algorithms have high per-iteration cost $\mathcal{O}(d)$ and usually produce non-sparse solutions in general, and hence are not suitable for handling the data challenge mentioned above. In this paper, we aim to directly optimize the AUC score for high-dimensional sparse datasets under online learning setting and propose a new algorithm, \textsc{FTRL-AUC}. Our proposed algorithm can process data in an online fashion with a much cheaper per-iteration cost $\mathcal{O}(k)$, making it amenable for high-dimensional sparse streaming data analysis. Our new algorithmic design critically depends on a novel reformulation of the U-statistics AUC objective function as the empirical saddle point reformulation, and the innovative introduction of the "lazy update" rule so that the per-iteration complexity is dramatically reduced from $\mathcal{O}(d)$ to $\mathcal{O}(k)$. Furthermore, \textsc{FTRL-AUC} can inherently capture sparsity more effectively by applying a generalized Follow-The-Regularized-Leader (FTRL) framework. Experiments on real-world datasets demonstrate that \textsc{FTRL-AUC} significantly improves both run time and model sparsity while achieving competitive AUC scores compared with the state-of-the-art methods. Comparison with the online learning method for logistic loss demonstrates that \textsc{FTRL-AUC} achieves higher AUC scores especially when datasets are imbalanced.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源