论文标题
PSC:语义代码搜索的基于路径的神经模型
PSCS: A Path-based Neural Model for Semantic Code Search
论文作者
论文摘要
为了获得用于重复使用的代码段,程序员更喜欢搜索相关文档,例如博客或问答,而不是代码本身。主要原因是由于查询和代码片段之间的语义多样性和不匹配。已经提出了深度学习模型来应对这一挑战。与使用信息检索技术的方法相比,深度学习模型不会遭受将用户意图精炼成关键字所造成的信息损失。但是,以前的作品的性能并不令人满意,因为它们忽略了代码结构的重要性。当代码的语义(例如标识符名称,API)模棱两可时,代码结构可能是模型使用的唯一功能。在这种情况下,先前的作品从词汇代码的词汇代码中重新学习了结构信息,对于没有任何领域知识的模型来说,这是非常困难的。在这项工作中,我们提出了PSC,这是一种基于路径的神经模型,用于语义代码搜索。我们的模型编码由AST路径代表的代码的语义和结构。我们分别训练并评估超过330k-19k查询功能对的模型。评估结果表明,在考虑与匹配匹配的前10个结果时,PSC的连续效率为47.6%,平均相互等级(MRR)为30.4%。所提出的方法极大地胜过DEEPC,第一种将深度学习应用于编码搜索任务的方法和CARLC是一种最先进的方法,该方法在DEEPC的基础上引入了共同的表示模型。通过对代码特征进行消融研究,证明了代码结构的重要性,该研究启发了模型设计以进行进一步研究。
To obtain code snippets for reuse, programmers prefer to search for related documents, e.g., blogs or Q&A, instead of code itself. The major reason is due to the semantic diversity and mismatch between queries and code snippets. Deep learning models have been proposed to address this challenge. Compared with approaches using information retrieval techniques, deep learning models do not suffer from the information loss caused by refining user intention into keywords. However, the performance of previous works is not satisfactory because they ignore the importance of code structure. When the semantics of code (e.g., identifier names, APIs) are ambiguous, code structure may be the only feature for the model to utilize. In that case, previous works relearn the structural information from lexical tokens of code, which is extremely difficult for a model without any domain knowledge. In this work, we propose PSCS, a path-based neural model for semantic code search. Our model encodes both the semantics and structures of code represented by AST paths. We train and evaluate our model over 330k-19k query-function pairs, respectively. The evaluation results demonstrate that PSCS achieves a SuccessRate of 47.6% and a Mean Reciprocal Rank (MRR) of 30.4% when considering the top-10 results with a match. The proposed approach significantly outperforms both DeepCS, the first approach that applies deep learning to code search task, and CARLCS, a state-of-the-art approach that introduces a co-attentive representation learning model on the basis of DeepCS. The importance of code structure is demonstrated with an ablation study on code features, which enlightens model design for further studies.