Revisiting Information Retrieval and Deep Learning Approaches for Code Summarization.

Tingwei Zhu,Zhong Li,Minxue Pan, Chaoxuan Shi,Tian Zhang, Yu Pei,Xuandong Li

ICSE Companion（2023）

引用 0|浏览22

暂无评分

摘要

Code summarization refers to the procedure of creating short descriptions that outline the semantics of source code snippets. Existing code summarization approaches can be broadly classified into Information Retrieval (IR)-based and Deep Learning (DL)-based approaches. However, their effectiveness, and especially their strengths and weaknesses, remain largely understudied. Existing evaluations use different benchmarks and metrics, making performance comparisons of these approaches susceptible to bias and potentially yielding misleading results. For example, the DL-based approaches typically show better code summarization performance in their original papers [1], [2]. However, Gros et al. [3] report that a naive IR approach could achieve comparable (or even better) performance to the DL-based ones. In addition, some recent work [4], [5] suggests that incorporating IR techniques can improve the DL-based approaches. To further advance code summarization techniques, it is critical that we have a good understanding of how IR-based and DL-based approaches perform on different datasets and in terms of different metrics. Prior works have studied some aspects of code summarization, such as the factors affecting performance evaluation [6] and the importance of data preprocessing [7], etc. In this paper, we focus on the study of the IR-based and DL-based code summarization approaches to enhance the understanding and design of more advanced techniques. We first compare the IR-based and DL-based approaches under the same experimental settings and benchmarks, then study their strengths and limitations through quantitative and qualitative analyses. Finally, we propose a simpler but effective strategy to combine IR and DL to further improve code summarization. Four IR-based approaches and two DL-based approaches are investigated with regard to representativeness and diversity. For IR-based approaches, we select three BM25-based approaches (i.e., BM25-spl, BM25-ast, BM25-alpha) and one nearest neighbor-based approach NNGen [8], which are often compared as baselines in prior works. They retrieve the most similar code from the database and directly output the corresponding summary. BM25-based approaches are implemented by Lucene [9]. Taking code forms as input, BM25-spl splits the CamelCase and snake_case in original source code tokens, BM25-ast obtains sequence representations using pre-order Abstract Syntax Tree (AST) traversal, and BM25-alpha keeps only the alpha tokens in the code. For DL-based approaches, we choose the state-of-the-art pre-trained model PLBART [2] and the trained-from-scratch model SiT[1]. We adopt four widely used Java datasets, namely TLC [10], CSN [11], HDC [12], and FCM [13] as our subject datasets. TLC and HDC are method-split datasets, where methods in the same project are randomly split into training/validation/test sets. CSN and FCM are project-split datasets, where examples from the same project exist in only one partition. We further process the four datasets to build cleaner datasets by removing examples that have syntax errors, empty method bodies, and too long or too short sequence lengths, etc. We also remove the duplicate examples in the validation and test sets. To comprehensively and systematically evaluate the performance of the code summarization approaches, we adopt three widely used metrics, i.e., BLEU (both C-BLEU and S-BLEU are included), ROUGE, and METEOR, in our experiments. Effectiveness. We conduct a comprehensive comparison of the six studied approaches under exactly the same settings and datasets. Table I shows the experimental results obtained on four subject datasets in terms of four metrics. Comparing by metrics from Table I, we can observe that overall there are large variations in the approach rankings and score gaps when using different metrics for evaluation. DL-based approaches generally achieve better performance than IR-based approaches in terms of METEOR and ROUGE-L. IR-based approaches achieve comparable or even better C-BLEU scores, but have lower S-BLEU scores than the DL-based approaches. This shows that different metrics have a large impact on the results of the approach evaluation. Different metrics are needed to evaluate the code summarization approaches. Considering different datasets, the pre-trained DL-based approach PLBART performs best among the six approaches studied. On the other hand, we notice that the IR-based approaches, despite their simplicity, also achieve comparable or even better performance, especially on method-split datasets. For example, the C-BLEU scores of BM25-spl on TLC and HDC are the highest among all approaches. Therefore, although DL-based methods usually show better performance for code summarization, we should not overlook the capabilities of IR-based methods. Strengths. To evaluate how the similarity between the training and test codes affects the performance of the approaches, we use the Retrieval-Similarity metric, as Rencos [4] did, to measure the token-level similarity between a test code and its most similar training code. Based on this, we examine how the BLEU score of each approach varies as the Retrieval-Similarity value changes on the four subject datasets. Figure 1 shows the results, from which we observe that IR-based approaches perform better than DL-based ones when the Retrieval-Similarity values are higher. Through qualitative analysis of examples with high retrieval similarity, we find that due to the cloning phenomenon, similar codes have similar summaries, so IR-based approaches tend to perform better on examples with high Retrieval-Similarity values. Integration. Based on previous findings, we are motivated to design a simpler integration approach. We propose to take advantage of Retrieval-Similarity to decide whether to use the IR or DL approach to generate a summary for the input code. Specifically, we first use Lucene to retrieve a similar code for the input and compute a Retrieval-Similarity value between them. If the value is higher than a similarity threshold, we directly use the IR summary. Otherwise, we choose the DL model to get the output. To determine the similarity threshold, we conduct grid search on the validation set. The similarity achieving the highest metric score on validation set is set as the final threshold. We choose the best DL model PLBART and the best IR approach BM25-spl for integration and evaluate our approach on all four cleaned datasets. The effectiveness results are shown in the 'Ours' row of Table I. From the table, we can see that our integration is effective and achieves state-of-the-art results. Not only does it outperform a single approach, but the scores it achieves are higher than all the previous highest scores in our experiments on all metrics across all datasets. In summary, our study shows that the IR and DL approaches have their own strengths in terms of performance using different metrics on different datasets. Although IR-based approaches are simpler, they can still achieve comparable or even better performance in some cases, especially in the presence of high-similarity code. Based on the results, we propose a simple integration approach that achieves state-of-the-art results. Our study shows that it is not enough to focus on the DL model alone. Taking advantage of IR approaches is a promising direction for improving code summarization. Future work should explore the incorporation of more types of information and more advanced integration methods.

查看译文

关键词

Code summarization,empirical study,information retrieval,deep learning

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要