使用可解释的功能工程和经验教训重新访问二进制代码相似性分析

论文标题

使用可解释的功能工程和经验教训重新访问二进制代码相似性分析

Revisiting Binary Code Similarity Analysis using Interpretable Feature Engineering and Lessons Learned

论文作者

Kim, Dongkwan, Kim, Eunsoo, Cha, Sang Kil, Son, Sooel, Kim, Yongdae

论文摘要

二进制代码相似性分析（BCSA）广泛用于各种安全应用程序，包括窃检测，违反软件许可证检测和漏洞发现。尽管对BCSA的研究兴趣激增，但由于几个原因，在该领域进行新的研究毫无疑问。首先，大多数现有方法仅着眼于最终结果，即通过采用不可解释的机器学习来提高BCSA的成功率。此外，他们利用自己的基准标准，既不共享源代码也不共享整个数据集。最后，研究人员经常使用不同的术语，甚至使用相同的技术，而没有正确地提到先前的文献，这使得很难复制或扩展以前的工作。为了解决这些问题，我们从主流和考虑BCSA的基本研究问题中退后一步。为什么某个技术或某个功能比其他功能显示出更好的结果？具体而言，我们通过利用大规模基准上的可解释的功能工程来对BCSA中使用的基本特征进行首次系统研究。我们的研究揭示了有关BCSA的各种有用的见解。例如，我们表明，具有一些基本功能的简单可解释模型可以与最近基于深度学习的方法相当。此外，我们表明，我们编译二进制文件的方式或潜在的二元分析工具的正确性可能会显着影响BCSA的性能。最后，我们将所有源代码和基准公开标准公开，并在该领域提出未来的方向，以帮助进一步研究。

Binary code similarity analysis (BCSA) is widely used for diverse security applications, including plagiarism detection, software license violation detection, and vulnerability discovery. Despite the surging research interest in BCSA, it is significantly challenging to perform new research in this field for several reasons. First, most existing approaches focus only on the end results, namely, increasing the success rate of BCSA, by adopting uninterpretable machine learning. Moreover, they utilize their own benchmark, sharing neither the source code nor the entire dataset. Finally, researchers often use different terminologies or even use the same technique without citing the previous literature properly, which makes it difficult to reproduce or extend previous work. To address these problems, we take a step back from the mainstream and contemplate fundamental research questions for BCSA. Why does a certain technique or a certain feature show better results than the others? Specifically, we conduct the first systematic study on the basic features used in BCSA by leveraging interpretable feature engineering on a large-scale benchmark. Our study reveals various useful insights on BCSA. For example, we show that a simple interpretable model with a few basic features can achieve a comparable result to that of recent deep learning-based approaches. Furthermore, we show that the way we compile binaries or the correctness of underlying binary analysis tools can significantly affect the performance of BCSA. Lastly, we make all our source code and benchmark public and suggest future directions in this field to help further research.

下载PDF全文

下载文献需遵守相关版权规定

论文标题