论文标题
欺诈数据集基准和应用程序
Fraud Dataset Benchmark and Applications
论文作者
论文摘要
标准化的数据集和基准刺激了计算机视觉,自然语言处理,多模式和表格设置的创新。我们注意到,与其他经过良好研究的领域相比,欺诈检测面临着独特的挑战:高级失衡,多样化的特征类型,经常改变的欺诈模式以及问题的对抗性。由于这些,在其他研究领域的数据集上评估的建模方法可能对欺诈检测不错。在本文中,我们介绍了欺诈数据集基准(FDB),该数据集的汇编汇编了涉及欺诈检测的公开数据集FDB包括各种相关任务,范围从识别欺诈性卡片 - 毫无疑问的交易,从识别欺诈性交易,检测机器人攻击,检测恶意URL,估计贷款of Leavealess Moderiation Defailition。 FDB的基于Python的库为数据加载提供了一致的API,并具有标准化的培训和测试拆分。我们展示了FDB的几种应用,这些应用对欺诈检测具有广泛的兴趣,包括功能工程,监督学习算法的比较,删除标签噪声,班级失控治疗和半监督学习。我们希望FDB为欺诈检测领域中的研究人员和从业人员提供一个共同的操场,以开发针对各种欺诈用例的强大而定制的机器学习技术。
Standardized datasets and benchmarks have spurred innovations in computer vision, natural language processing, multi-modal and tabular settings. We note that, as compared to other well researched fields, fraud detection has unique challenges: high-class imbalance, diverse feature types, frequently changing fraud patterns, and adversarial nature of the problem. Due to these, the modeling approaches evaluated on datasets from other research fields may not work well for the fraud detection. In this paper, we introduce Fraud Dataset Benchmark (FDB), a compilation of publicly available datasets catered to fraud detection FDB comprises variety of fraud related tasks, ranging from identifying fraudulent card-not-present transactions, detecting bot attacks, classifying malicious URLs, estimating risk of loan default to content moderation. The Python based library for FDB provides a consistent API for data loading with standardized training and testing splits. We demonstrate several applications of FDB that are of broad interest for fraud detection, including feature engineering, comparison of supervised learning algorithms, label noise removal, class-imbalance treatment and semi-supervised learning. We hope that FDB provides a common playground for researchers and practitioners in the fraud detection domain to develop robust and customized machine learning techniques targeting various fraud use cases.