在线AUC优化稀疏高维数据集

论文标题

在线AUC优化稀疏高维数据集

Online AUC Optimization for Sparse High-Dimensional Datasets

论文作者

Zhou, Baojian, Ying, Yiming, Skiena, Steven

论文摘要

ROC曲线（AUC）下的面积是一种广泛使用的性能度量，用于不平衡分类，该分类是由许多应用稀疏数据丰富的应用领域引起的。在这种情况下，每个$ d $ dimensional样本只有$ k $ non-Zero具有$ k \ ll d $的功能，并且数据以流式形式顺序到达。当前的在线AUC优化算法具有较高的每卷成本$ \ MATHCAL {O}（D）$，通常会生成非SPARSE解决方案，因此不适合处理上面提到的数据挑战。在本文中，我们旨在在线学习设置下直接优化高维稀疏数据集的AUC分数，并提出一种新算法，\ textsc {ftrl-auc}。我们提出的算法可以以在线方式处理数据，每题的价格便宜得多，$ \ Mathcal {o}（k）$，使其适合于高维稀疏流数据分析。我们的新算法设计在很大程度上取决于对U-Statistical AUC目标功能作为经验鞍点重新印度的重新重新重新制定，以及“懒更新”规则的创新介绍，以使痛苦复杂性大大降低了$ \ \ \ \ \ \ \\ Mathcal {O}（d）$} $} $ f $ \ nathcalcalcal {o} $} $} $} $}（k）（此外，\ textsc {ftrl-auc}可以通过应用一般的遵循指定的领导者（FTRL）框架来固有地捕获稀疏性。现实世界数据集的实验表明，\ textsc {ftrl-auc}显着改善了运行时间和模型稀疏性，同时与最先进的方法相比，达到竞争性AUC得分。与在线学习方法进行比较的物流损失方法表明，\ textsc {ftrl-auc}达到更高的AUC分数，尤其是当数据集不平衡时。

The Area Under the ROC Curve (AUC) is a widely used performance measure for imbalanced classification arising from many application domains where high-dimensional sparse data is abundant. In such cases, each $d$ dimensional sample has only $k$ non-zero features with $k \ll d$, and data arrives sequentially in a streaming form. Current online AUC optimization algorithms have high per-iteration cost $\mathcal{O}(d)$ and usually produce non-sparse solutions in general, and hence are not suitable for handling the data challenge mentioned above. In this paper, we aim to directly optimize the AUC score for high-dimensional sparse datasets under online learning setting and propose a new algorithm, \textsc{FTRL-AUC}. Our proposed algorithm can process data in an online fashion with a much cheaper per-iteration cost $\mathcal{O}(k)$, making it amenable for high-dimensional sparse streaming data analysis. Our new algorithmic design critically depends on a novel reformulation of the U-statistics AUC objective function as the empirical saddle point reformulation, and the innovative introduction of the "lazy update" rule so that the per-iteration complexity is dramatically reduced from $\mathcal{O}(d)$ to $\mathcal{O}(k)$. Furthermore, \textsc{FTRL-AUC} can inherently capture sparsity more effectively by applying a generalized Follow-The-Regularized-Leader (FTRL) framework. Experiments on real-world datasets demonstrate that \textsc{FTRL-AUC} significantly improves both run time and model sparsity while achieving competitive AUC scores compared with the state-of-the-art methods. Comparison with the online learning method for logistic loss demonstrates that \textsc{FTRL-AUC} achieves higher AUC scores especially when datasets are imbalanced.

下载PDF全文

下载文献需遵守相关版权规定

论文标题