论文标题

回答技术查询的摘要:基准和新方法

Answer Summarization for Technical Queries: Benchmark and New Approach

论文作者

Chengran, Yang, Xu, Bowen, Thung, Ferdian, Shi, Yucen, Zhang, Ting, Yang, Zhou, Zhou, Xin, Shi, Jieke, He, Junda, Han, DongGyun, Lo, David

论文摘要

先前的研究表明,需要在软件问答(SQA)站点中为给定技术查询生成答案摘要的方法。我们发现现有方法仅通过用户研究来评估。有必要具有基础真相摘要的基准,以通过用户研究来补充评估。不幸的是,这种基准是不存在SQA站点技术查询的答案摘要。为了填补空白,我们手动构建高质量的基准,以自动评估SQA站点技术查询的答案摘要。使用基准测试,我们全面评估了现有方法的性能,并发现仍然有一个很大的改进空间。 在结果的激励下,我们提出了一种具有三个关键模块的新方法Techsumbot:1)有用性排名模块,2)中心性估计模块和3)冗余删除模块。我们以自动(即使用基准)和手册(即通过用户研究)的方式评估Techsumbot。两项评估的结果始终表明,Techsumbot的表现优于SE和NLP领域的最佳性能基线方法,即10.83%-14.90%,32.75%-36.59%,以及12.61%-17.54%,并在Rouge-1,Rouge-1,Rouge-1,Rouge-2和Rouge-2上5.79%-9.23%和17.03%-17.68%,就人类评估的平均有用性和多样性评分而言。这强调了我们基准测试的自动评估可以发现与用户研究发现的发现相似。更重要的是,自动评估的成本要低得多,尤其是当它用于评估新方法时。此外,我们还进行了一项消融研究,该研究表明Techsumbot中的每个模块都有助于提高Techsumbot的整体性能。

Prior studies have demonstrated that approaches to generate an answer summary for a given technical query in Software Question and Answer (SQA) sites are desired. We find that existing approaches are assessed solely through user studies. There is a need for a benchmark with ground truth summaries to complement assessment through user studies. Unfortunately, such a benchmark is non-existent for answer summarization for technical queries from SQA sites. To fill the gap, we manually construct a high-quality benchmark to enable automatic evaluation of answer summarization for technical queries for SQA sites. Using the benchmark, we comprehensively evaluate the performance of existing approaches and find that there is still a big room for improvement. Motivated by the results, we propose a new approach TechSumBot with three key modules:1) Usefulness Ranking module, 2) Centrality Estimation module, and 3) Redundancy Removal module. We evaluate TechSumBot in both automatic (i.e., using our benchmark) and manual (i.e., via a user study) manners. The results from both evaluations consistently demonstrate that TechSumBot outperforms the best performing baseline approaches from both SE and NLP domains by a large margin, i.e., 10.83%-14.90%, 32.75%-36.59%, and 12.61%-17.54%, in terms of ROUGE-1, ROUGE-2, and ROUGE-L on automatic evaluation, and 5.79%-9.23% and 17.03%-17.68%, in terms of average usefulness and diversity score on human evaluation. This highlights that the automatic evaluation of our benchmark can uncover findings similar to the ones found through user studies. More importantly, automatic evaluation has a much lower cost, especially when it is used to assess a new approach. Additionally, we also conducted an ablation study, which demonstrates that each module in TechSumBot contributes to boosting the overall performance of TechSumBot.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源