谷歌浏览器插件
订阅小程序
在清言上使用

Do Not Have Enough Data? An Easy Data Augmentation for Code Summarization

2022 IEEE 13th International Symposium on Parallel Architectures, Algorithms and Programming (PAAP)(2022)

引用 0|浏览6
暂无评分
摘要
Code comments improve the readability and intelligibility of codes, which can help developers understand programs and improve the efficiency of the software maintenance and evolution process. Unfortunately, code comments are often mismatched, missing, or outdated in software projects, which negatively affects the efficiency of developers to infer the functionality from source code and affect the efficiency of software maintenance and evolution. To solve this problem, many source code summarization algorithms have been proposed. However, these methods usually try to collect a large data set which contains the mapping between code comments and source code to train models. Hence, the effectiveness of the models often rely on the quality of the training data. There are two limitations for the training sets: the insufficient data collection limitation (i.e., generate a large amount of noises-free training data) and data distribution bias limitation (i.e., generate training data for infrequently used methods). To address this issues, we have proposed a data augmentation method for code comments, named CDA-CS. Extensive experiments on Java and Python projects collected from GitHub are conducted to evaluate the performance of CDA-CS. Training models on the augmented dataset, the state-of-the-art algorithms can easily get a further 1.37% to 2.24% improvement in terms of different evaluation metrics (i.e., BLEU-4, METEOR, ROUGH_L) with no additional cost.
更多
查看译文
关键词
Code Summary,Data Augmentation,Clustering,Word Replacement
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要