论文标题

SOK:未标记数据在网络危机中的影响

SoK: The Impact of Unlabelled Data in Cyberthreat Detection

论文作者

Apruzzese, Giovanni, Laskov, Pavel, Tastemirova, Aliya

论文摘要

近年来,机器学习(ML)已成为网络威胁检测(CTD)的重要范式。已经在开发CTD任务的专门算法上投入了大量的研究工作。但是,从操作的角度来看,基于ML的CTD的进展受到获得大量标记数据以训练ML检测器的困难。解决此问题的潜在解决方案是半私人学习(SSL)方法,该方法将小标记的数据集与大量未标记数据结合在一起。 本文旨在系统化CTD的SSL现有工作,尤其是了解此类系统中未标记数据的实用性。为此,我们分析了在各种CTD任务中标记的成本,并在这种情况下为SSL开发正式的成本模型。在此基础的基础上,我们正式对SSL方法的评估进行了正式要求,这阐明了未标记数据的贡献。我们审查了最先进的问题,并观察到以前的工作不符合此类要求。为了解决这个问题,我们提出了一个框架,以评估SSL中未标记数据的好处。我们通过执行第一个基准评估来展示此框架的应用,该评估突出了9个公共数据集中9种现有SSL方法的权衡。我们的发现证明,在某些情况下,未标记的数据提供了一个较小但具有统计学意义的性能增长。本文强调,CTD中的SSL有很大的改进空间,这应该刺激该领域的未来研究。

Machine learning (ML) has become an important paradigm for cyberthreat detection (CTD) in the recent years. A substantial research effort has been invested in the development of specialized algorithms for CTD tasks. From the operational perspective, however, the progress of ML-based CTD is hindered by the difficulty in obtaining the large sets of labelled data to train ML detectors. A potential solution to this problem are semisupervised learning (SsL) methods, which combine small labelled datasets with large amounts of unlabelled data. This paper is aimed at systematization of existing work on SsL for CTD and, in particular, on understanding the utility of unlabelled data in such systems. To this end, we analyze the cost of labelling in various CTD tasks and develop a formal cost model for SsL in this context. Building on this foundation, we formalize a set of requirements for evaluation of SsL methods, which elucidates the contribution of unlabelled data. We review the state-of-the-art and observe that no previous work meets such requirements. To address this problem, we propose a framework for assessing the benefits of unlabelled data in SsL. We showcase an application of this framework by performing the first benchmark evaluation that highlights the tradeoffs of 9 existing SsL methods on 9 public datasets. Our findings verify that, in some cases, unlabelled data provides a small, but statistically significant, performance gain. This paper highlights that SsL in CTD has a lot of room for improvement, which should stimulate future research in this field.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源