论文标题
二进制可执行文件中基于深度学习的漏洞检测
Deep-Learning-based Vulnerability Detection in Binary Executables
论文作者
论文摘要
识别漏洞是软件开发生命周期中确保软件安全性的重要因素。虽然基于源代码的漏洞标识是一个经过深入研究的字段,但在没有相应源代码的情况下,根据二进制可执行文件识别漏洞更具挑战性。最近的研究[1]表明,如何通过深度学习方法可以实现这种检测。但是,这种特殊方法仅限于仅识别4种类型的漏洞。随后,我们分析了我们可以在多大程度上涵盖更大漏洞的识别。因此,使用了基于二进制可执行文件的漏洞检测的经常性神经网络的监督深度学习方法。基础是一个数据集,具有50,651个脆弱代码样本,标准化LLVM中间表示形式。 Word2Vec模型的矢量化特征用于训练三种复发神经网络(GRU,LSTM,SRNN)的三种基本体系结构的不同变化。建立了二进制分类以检测任意漏洞的存在,并培训了多个级别模型以识别确切的漏洞,该漏洞分别达到了88%和77%的样本外精度。还观察到了检测到不同漏洞的差异,不可忽视的样品的精度特别高98%。因此,提出的方法允许准确检测23(相比4 [1])漏洞。
The identification of vulnerabilities is an important element in the software development life cycle to ensure the security of software. While vulnerability identification based on the source code is a well studied field, the identification of vulnerabilities on basis of a binary executable without the corresponding source code is more challenging. Recent research [1] has shown, how such detection can be achieved by deep learning methods. However, that particular approach is limited to the identification of only 4 types of vulnerabilities. Subsequently, we analyze to what extent we could cover the identification of a larger variety of vulnerabilities. Therefore, a supervised deep learning approach using recurrent neural networks for the application of vulnerability detection based on binary executables is used. The underlying basis is a dataset with 50,651 samples of vulnerable code in the form of a standardized LLVM Intermediate Representation. The vectorised features of a Word2Vec model are used to train different variations of three basic architectures of recurrent neural networks (GRU, LSTM, SRNN). A binary classification was established for detecting the presence of an arbitrary vulnerability, and a multi-class model was trained for the identification of the exact vulnerability, which achieved an out-of-sample accuracy of 88% and 77%, respectively. Differences in the detection of different vulnerabilities were also observed, with non-vulnerable samples being detected with a particularly high precision of over 98%. Thus, the methodology presented allows an accurate detection of 23 (compared to 4 [1]) vulnerabilities.