Using the Lexicon from Source Code to Determine Application Domain

EASE(2020)

引用 4|浏览54
暂无评分
摘要
Context: The vast majority of software engineering research is reported independently of the application domain: techniques and tools usage is reported without any domain context. As reported in previous research, this has not always been so: early in the computing era, the research focus was frequently application domain specific (for example, scientific and data processing). Objective: We believe determining the research context is often important. Therefore we propose a code-based approach to identify the application domain of a software system, via its lexicon. We compare its use against the plain textual description attached to the same system. Method: Using a sample of 50 Java projects, we obtained i) the description of each project (e.g., its ReadMe file), ii) the lexicon extracted from its source code, and iii) a list of its main topics extracted with the Latent Dirichlet Allocation (LDA) modelling technique. We assigned a random subset of these data items to different researchers (i.e., 'experts'), and asked them to assign each item to one (or more) application domain. We then evaluated the precision and accuracy of the three techniques. Results: Using the agreement levels between experts, We observed that the 'baseline' dataset (i.e., the ReadMe files) obtained the highest average in terms of agreement between experts, but we also observed that the three techniques had the same mode and median agreement levels. Additionally, in the cases where no agreement was reached for the baseline dataset, the two other techniques provided sufficient additional support. Conclusions: We conclude that the source code is sufficient for determining the application domain, so that classification is possible without special documentation requirements.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要