Information Retrieval (IR) plays a pivotal role in diverse Software Engineering (SE) tasks, e.g., bug localization and triaging, code retrieval, requirements analysis, etc. The choice of similarity measure is the core component of an IR technique. The performance of any IR method critically depends on selecting an appropriate similarity measure for the given application domain. Since different SE tasks operate on different document types like bug reports, software descriptions, source code, etc. that often contain non-standard domain-specific vocabulary, it is essential to understand which similarity measures work best for different SE documents.
This paper presents two case studies on the effect of different similarity measure on various SE documents w.r.t. two tasks: (i) project recommendation: finding similar GitHub projects and (ii) bug localization: retrieving buggy source file(s) correspond to a bug report. These tasks contain a diverse combination of textual (i.e. description, readme) and code (i.e. source code, API, import package) artifacts. We observe that the performance of IR models varies when applied to different artifact types. We find that, in general, the context-aware models achieve better performance on textual artifacts. In contrast, simple keyword-based bag-of-words models perform better on code artifacts. On the other hand, the probabilistic ranking model BM25 performs better on a mixture of text and code artifacts.
We further investigate how such an informed choice of similarity measure impacts the performance of SE tools. In particular, we analyze two previously proposed tools for project recommendation and bug localization tasks, which leverage diverse software artifacts, and observe that an informed choice of similarity measure indeed leads to improved performance of the existing SE tools.
信息检索(IR)在多样化的软件工程(SE)任务中起关键作用,例如,错误本地化和分三局,代码检索,需求分析等。相似性度量的选择是IR技术的核心组成部分。任何IR方法的性能在关键上取决于为给定的应用域选择适当的相似性度量。由于不同的SE任务在不同的文档类型(例如错误报告,软件说明,源代码等)上运行,通常包含非标准域特异性词汇,因此必须了解哪些相似性指标最适合不同的SE文档。
本文介绍了两项有关不同相似性度量对各种SE文档W.R.T.的影响的案例研究。两个任务:(i)项目建议:查找类似的GitHub项目,(ii)错误本地化:检索错误源文件(S)对应于错误报告。这些任务包含文本(即描述,读书)和代码(即源代码,API,导入软件包)工件的各种组合。我们观察到,当应用于不同的人工类型时,IR模型的性能会有所不同。我们发现,总的来说,上下文感知模型在文本文物上实现了更好的性能。相比之下,简单的基于关键字的字袋模型在代码文物上的表现更好。另一方面,概率排名模型BM25在文本和代码文物的混合物中的性能更好。
我们进一步研究了这种明智的相似性措施选择如何影响SE工具的性能。特别是,我们分析了两个先前提出的用于项目建议和错误本地化任务的工具,这些工具利用了各种软件工件,并观察到,相似性措施的明智选择确实会导致现有SE工具的性能提高。