跨语言源代码克隆检测使用深编码的深度学习

论文标题

跨语言源代码克隆检测使用深编码的深度学习

Cross-Language Source Code Clone Detection Using Deep Learning with InferCode

论文作者

Yahya, Mohammad A., Kim, Dae-Kyoo

论文摘要

软件克隆有益于一种以一种编程语言或多种语言来检测安全差距和软件维护。源克隆检测的现有工作表现良好，但采用单个编程语言。但是，如果用不同的编程语言编写具有相同功能的代码，则检测它更难，因为不同的编程语言具有不同的词汇结构。此外，大多数现有工作都依赖手动功能工程。在本文中，我们提出了一个基于源代码AST嵌入的深神经网络模型，以源代码的端到端方式检测跨语言克隆，而无需手动过程，以在不同的编程语言上查明相似的功能。为了克服数据短缺并减少过度拟合，采用了暹罗体系结构。我们模型的设计方法是双重的 - （a）它接受AST嵌入作为两种不同编程语言的输入，并且（b）它使用深层神经网络从这些嵌入式中学习抽象特征，以提高跨语言克隆检测的准确性。对模型的早期评估观察到平均精度，召回和F量评分分别为$ 0.99 $，$ 0.59 $和0.80美元，这表明我们的模型在跨语言克隆检测中胜过所有可用模型。

Software clones are beneficial to detect security gaps and software maintenance in one programming language or across multiple languages. The existing work on source clone detection performs well but in a single programming language. However, if a piece of code with the same functionality is written in different programming languages, detecting it is harder as different programming languages have a different lexical structure. Moreover, most existing work rely on manual feature engineering. In this paper, we propose a deep neural network model based on source code AST embeddings to detect cross-language clones in an end-to-end fashion of the source code without the need of the manual process to pinpoint similar features across different programming languages. To overcome data shortage and reduce overfitting, a Siamese architecture is employed. The design methodology of our model is twofold -- (a) it accepts AST embeddings as input for two different programming languages, and (b) it uses a deep neural network to learn abstract features from these embeddings to improve the accuracy of cross-language clone detection. The early evaluation of the model observes an average precision, recall and F-measure score of $0.99$, $0.59$ and $0.80$ respectively, which indicates that our model outperforms all available models in cross-language clone detection.

下载PDF全文

下载文献需遵守相关版权规定

论文标题