Deep Just-In-Time Inconsistency Detection Between Comments and Source Code

Cited by: 0|Bibtex|Views15
Other Links: arxiv.org
Weibo:
We developed a deep learning approach for just-in-time inconsistency detection between code and comments by learning to relate comments and code changes

Abstract:

Natural language comments convey key aspects of source code such as implementation, usage, and pre- and post-conditions. Failure to update comments accordingly when the corresponding code is modified introduces inconsistencies, which is known to lead to confusion and software bugs. In this paper, we aim to detect whether a comment becom...More

Code:

Data:

0
Introduction
  • Comments serve as a critical communication medium for developers, facilitating program comprehension and code maintenance tasks (Buse and Weimer 2010; de Souza, Anquetil, and de Oliveira 2005).
  • Code is highly-dynamic in nature, with developers constantly making changes to address bugs and feature requests.
  • Prior research has predominantly focused on detecting on inconsistencies that already reside in a software project, within the code repository.
  • The authors refer to this as post hoc inconsistency detection since it occurs potentially many commits after the inconsistency has been introduced
Highlights
  • Comments serve as a critical communication medium for developers, facilitating program comprehension and code maintenance tasks (Buse and Weimer 2010; de Souza, Anquetil, and de Oliveira 2005)
  • Because inconsistent comments generally arise as a consequence of developers failing to update comments immediately following code changes (Wen et al 2019), we aim to detect whether a comment becomes inconsistent as a result of changes to the accompanying code, before these changes are merged into a code repository
  • (2) For training and evaluation, we construct a large corpus of comments paired with code changes in the corresponding methods, encompassing multiple types of method comments and consisting of 40,688 examples that are extracted from 1,518 open-source Java projects.1 (3) We demonstrate the value of inconsistency detection in a comprehensive automatic comment maintenance system, and we show how our approach can support such a system
  • We developed a deep learning approach for just-in-time inconsistency detection between code and comments by learning to relate comments and code changes
  • Based on evaluation on a large corpus consisting of multiple types of comments, we showed that our model substantially outperforms various baselines as well as post hoc models that do not consider code changes
  • We further conduct extrinsic evaluation in which we demonstrate that our approach can be used to build a comprehensive comment maintenance system that can detect and resolve inconsistent comments
Results
Conclusion
  • The authors developed a deep learning approach for just-in-time inconsistency detection between code and comments by learning to relate comments and code changes.
  • Based on evaluation on a large corpus consisting of multiple types of comments, the authors showed that the model substantially outperforms various baselines as well as post hoc models that do not consider code changes.
  • The authors further conduct extrinsic evaluation in which the authors demonstrate that the approach can be used to build a comprehensive comment maintenance system that can detect and resolve inconsistent comments
Summary
  • Introduction:

    Comments serve as a critical communication medium for developers, facilitating program comprehension and code maintenance tasks (Buse and Weimer 2010; de Souza, Anquetil, and de Oliveira 2005).
  • Code is highly-dynamic in nature, with developers constantly making changes to address bugs and feature requests.
  • Prior research has predominantly focused on detecting on inconsistencies that already reside in a software project, within the code repository.
  • The authors refer to this as post hoc inconsistency detection since it occurs potentially many commits after the inconsistency has been introduced
  • Objectives:

    The authors aim to detect whether a comment becomes inconsistent as a result of changes to the corresponding body of code, in order to catch potential inconsistencies just-in-time, i.e., before they are committed to a version control system.
  • Because inconsistent comments generally arise as a consequence of developers failing to update comments immediately following code changes (Wen et al 2019), the authors aim to detect whether a comment becomes inconsistent as a result of changes to the accompanying code, before these changes are merged into a code repository.
  • The authors aim to determine whether C is inconsistent by understanding its semantics and how it relates to M
  • Results:

    Recall, F1, and accuracy for detection. The update metrics come from Panthaplackel et al (2020b), including exact match (xMatch) as well as metrics used to evaluate text generation (BLEU-4 (Papineni et al 2002) and METEOR (Banerjee and Lavie 2005)) and text editing tasks (SARI (Xu et al 2016) and GLEU (Napoles et al 2015)).
  • Since the dataset is balanced, the authors can get 50% exact match by copying C
  • This can even beat Panthaplackel et al (2020b) on xMatch, METEOR, BLEU-4, and SARI, and GLEU.
  • Because Panthaplackel et al (2020b) is designed to always edit, it can perform well on this metric; the majority of the pretrained and jointly trained systems can beat this
  • Conclusion:

    The authors developed a deep learning approach for just-in-time inconsistency detection between code and comments by learning to relate comments and code changes.
  • Based on evaluation on a large corpus consisting of multiple types of comments, the authors showed that the model substantially outperforms various baselines as well as post hoc models that do not consider code changes.
  • The authors further conduct extrinsic evaluation in which the authors demonstrate that the approach can be used to build a comprehensive comment maintenance system that can detect and resolve inconsistent comments
Tables
  • Table1: Data partitions
  • Table2: Results for baselines, post hoc, and just-in-time models. Differences in F1 and Acc between just-in-time vs. baseline models, just-in-time vs. post hoc models, and just-in-time + features vs. just-in-time models are statistically significant
  • Table3: Evaluating performance with respect to different types of comments. Scores are averaged across 3 random restarts, and scores for which the difference in performance is not statistically significant are shown with identical symbols
  • Table4: Results on joint inconsistency detection and update on the cleaned test sample. Scores for which the difference in performance is not statistically significant are shown with identical symbols
  • Table5: Dataset sizes before downsampling
  • Table6: Statistics on the average lengths of comment and code representations
  • Table7: Results for @return examples. Scores for which the difference in performance is not statistically significant are shown with identical symbols
  • Table8: Results for @param examples. Scores for which the difference in performance is not statistically significant are shown with identical symbols
  • Table9: Results for summary comment examples. Scores for which the difference in performance is not statistically significant are shown with identical symbols
  • Table10: Results on joint inconsistency detection and update on the full test set. Scores for which the difference in performance is not statistically significant are shown with identical symbols
  • Table11: Analyzing implicit code edit representations. Differences in F1 and Acc between just-in-time (explicit) models vs. post hoc models and just-in-time (explicit) vs. just-in-time (implicit) are statistically significant
  • Table12: Results for various configurations of Update w/ implicit detection on the cleaned test sample. Scores for which the difference in performance is not statistically significant are shown with identical symbols
  • Table13: Results for various configurations of Update w/ implicit detection on the full test set. Scores for which the difference in performance is not statistically significant are shown with identical symbols
  • Table14: Average number of training epochs, and training time for inconsistency detection models as well as the combined detection+update models. Note that we used two different types of GPUs for these experiments, and therefore, the times are not necessarily comparable across models. Additionally, following every epoch of training the detection-only models, we compute precision, recall, F1, and Acc (using scikit-learn) on the validation data (as this determines the training termination condition), which adds to the computation time
Download tables as Excel
Related work
  • Code/Comment Inconsistencies: Previous studies analyze the co-evolution of comments and code to understand how inconsistencies emerge (Fluri et al 2009; Jiang and Hassan 2006; Ibrahim et al 2012; Fluri, Wursch, and Gall 2007) and the various types of inconsistencies (Wen et al 2019); however, they do not propose techniques for addressing the problem. Post Hoc Inconsistency Detection: Prior work propose rule-based approaches for detecting pre-existing inconsistencies, which are tailored towards specific domains. These include inconsistencies relating to locks (Tan et al 2007), interrupts (Tan, Zhou, and Padioleau 2011), null exceptions for method parameters (Zhou et al 2017; Tan et al 2012), and renamed identifiers (Ratol and Robillard 2017). The comments they consider are consequently constrained to certain templates relevant to their respective domains. In contrast, we develop a general-purpose, machine learning approach that is not catered towards any specific types of inconsistencies or comments. Corazza, Maggio, and Scanniello (2018) and Cimasa et al (2019) address a broader notion of coherence between comments and code through text-similarity techniques, and Khamis, Witte, and Rilling (2010) determine whether comments, specifically @return and @param comments, conform to particular format. We instead capture deeper code/comment relationships by learning their syntactic and semantic structures. Rabbi and Siddik (2020) recently proposed a post hoc approach, entailing a siamese network for correlating comment and code representations. In our work, we instead aim to correlate comments and code through an attention mechanism.
Funding
  • In the post hoc setting, we find that our three models can achieve higher F1 scores than the bag-of-words approach proposed by Corazza, Maggio, and Scanniello (2018); however, they underperform the CodeBERT BOW (C, M ) baseline and sig-
Reference
  • Allamanis, M. 2019. The Adverse Effects of Code Duplication in Machine Learning Models of Code. In SPLASH, Onward!, 143– 153.
    Google ScholarLocate open access versionFindings
  • Alon, U.; Brody, S.; Levy, O.; and Yahav, E. 2019. code2seq: Generating Sequences from Structured Representations of Code. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Banerjee, S.; and Lavie, A. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 65–72.
    Google ScholarLocate open access versionFindings
  • Berg-Kirkpatrick, T.; Burkett, D.; and Klein, D. 2012. An Empirical Investigation of Statistical Significance in NLP. In Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 995–1005.
    Google ScholarLocate open access versionFindings
  • Buse, R. P. L.; and Weimer, W. R. 2010. Learning a Metric for Code Readability. IEEE Transactions on Software Engineering 36(4): 546–558.
    Google ScholarLocate open access versionFindings
  • Chen, R.-C.; Yulianti, E.; Sanderson, M.; and Croft, W. B. 2017. On the Benefit of Incorporating External Features in a Neural Architecture for Answer Sentence Selection. In SIGIR Conference on Research and Development in Information Retrieval, 1017–1020.
    Google ScholarLocate open access versionFindings
  • Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. In Conference on Empirical Methods in Natural Language Processing, 1724–1734.
    Google ScholarLocate open access versionFindings
  • Cimasa, A.; Corazza, A.; Coviello, C.; and Scanniello, G. 2019. Word Embeddings for Comment Coherence. In Euromicro Conference on Software Engineering and Advanced Applications (SEAA), 244–251.
    Google ScholarLocate open access versionFindings
  • Corazza, A.; Maggio, V.; and Scanniello, G. 2018. Coherence of Comments and Method Implementations: A Dataset and an Empirical Investigation. Software Quality Journal 26(2): 751–777.
    Google ScholarLocate open access versionFindings
  • Cvitkovic, M.; Singh, B.; and Anandkumar, A. 2019. Open Vocabulary Learning on Source Code with a Graph-Structured Cache. In International Conference on Machine Learning, 1475–1485.
    Google ScholarLocate open access versionFindings
  • de Souza, S. C. B.; Anquetil, N.; and de Oliveira, K. M. 2005. A Study of the Documentation Essential to Software Maintenance. In International Conference on Design of Communication: Documenting & Designing for Pervasive Information, 68–75.
    Google ScholarLocate open access versionFindings
  • Falleri, J.-R.; Morandat, F.; Blanc, X.; Martinez, M.; and Monperrus, M. 2014. Fine-Grained and Accurate Source Code Differencing. In International Conference on Automated Software Engineering, 313–324.
    Google ScholarLocate open access versionFindings
  • Feng, Z.; Guo, D.; Tang, D.; Duan, N.; Feng, X.; Gong, M.; Shou, L.; Qin, B.; Liu, T.; Jiang, D.; and Zhou, M. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. ArXiv abs/2002.08155.
    Findings
  • Fernandes, P.; Allamanis, M.; and Brockschmidt, M. 2019. Structured Neural Summarization. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Fluri, B.; Wursch, M.; and Gall, H. C. 2007. Do Code and Comments Co-Evolve? On the Relation Between Source Code and Comment changes. In Working Conference on Reverse Engineering, 70–79.
    Google ScholarLocate open access versionFindings
  • Fluri, B.; Wursch, M.; Giger, E.; and Gall, H. C. 2009. Analyzing the Co-Evolution of Comments and Source Code. Software Quality Journal 17(4): 367–394.
    Google ScholarLocate open access versionFindings
  • Hellendoorn, V. J.; Sutton, C.; Singh, R.; Maniatis, P.; and Bieber, D. 2020. Global Relational Models of Source Code. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Ibrahim, W. M.; Bettenburg, N.; Adams, B.; and Hassan, A. E. 2012. On the Relationship between Comment Update Practices and Software Bugs. Journal of Systems and Software 85(10): 2293–2304.
    Google ScholarLocate open access versionFindings
  • Jarczyk, O.; Gruszka, B.; Jaroszewicz, S.; Bukowski, L.; and Wierzbicki, A. 2014. GitHub Projects. Quality Analysis of OpenSource Software. In International Conference on Social Informatics, 80–94.
    Google ScholarLocate open access versionFindings
  • Jiang, Z. M.; and Hassan, A. E. 2006. Examining the Evolution of Code Comments in PostgreSQL. In International Workshop on Mining Software Repositories, 179–180.
    Google ScholarLocate open access versionFindings
  • Khamis, N.; Witte, R.; and Rilling, J. 2010. Automatic Quality Assessment of Source Code Comments: The JavadocMiner. In Natural Language Processing and Information Systems, 68–79.
    Google ScholarLocate open access versionFindings
  • Li, Y.; Tarlow, D.; Brockschmidt, M.; and Zemel, R. S. 2015. Gated Graph Sequence Neural Networks. In ICLR.
    Google ScholarLocate open access versionFindings
  • Liu, Z.; Chen, H.; Chen, X.; Luo, X.; and Zhou, F. 2018. Automatic Detection of Outdated Comments During Code Changes. In Annual Computer Software and Applications Conference, 154– 163.
    Google ScholarLocate open access versionFindings
  • Malik, H.; Chowdhury, I.; Tsou, H.-M.; Jiang, Z. M.; and Hassan, A. E. 2008. Understanding the Rationale for Updating a Function’s Comment. In International Conference on Software Maintenance, 167–176.
    Google ScholarLocate open access versionFindings
  • Movshovitz-Attias, D.; and Cohen, W. W. 2013. Natural language models for predicting programming comments. In Annual Meeting of the Association for Computational Linguistics, 35–40.
    Google ScholarLocate open access versionFindings
  • Napoles, C.; Sakaguchi, K.; Post, M.; and Tetreault, J. 2015. Ground Truth for Grammatical Error Correction Metrics. In Annual Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing, 588–593.
    Google ScholarLocate open access versionFindings
  • Nie, P.; Rai, R.; Li, J. J.; Khurshid, S.; Mooney, R. J.; and Gligoric, M. 2019. A Framework for Writing Trigger-Action Todo Comments in Executable Format. In Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 385–396.
    Google ScholarLocate open access versionFindings
  • Oracle. 2020. Javadoc. https://docs.oracle.com/javase/8/docs/technotes/tools/windows/javadoc.html.
    Findings
  • Panthaplackel, S.; Gligoric, M.; Mooney, R. J.; and Li, J. J. 2020a. Associating Natural Language Comment and Source Code Entities. In AAAI Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • Panthaplackel, S.; Nie, P.; Gligoric, M.; Li, J. J.; and Mooney, R. J. 2020b. Learning to Update Natural Language Comments Based on Code Changes. In Annual Meeting of the Association for Computational Linguistics, 1853–1868.
    Google ScholarLocate open access versionFindings
  • Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. BLEU: a Method for Automatic Evaluation of Machine Translation. In Annual Meeting of the Association for Computational Linguistics, 311–318.
    Google ScholarLocate open access versionFindings
  • Rabbi, F.; and Siddik, M. S. 2020. Detecting Code Comment Inconsistency Using Siamese Recurrent Network. In International Conference on Program Comprehension - Early Research Achievements, 371–375.
    Google ScholarLocate open access versionFindings
  • Ratol, I. K.; and Robillard, M. P. 2017. Detecting Fragile Comments. International Conference on Automated Software Engineering 112–122.
    Google ScholarLocate open access versionFindings
  • Ren, X.; Xing, Z.; Xia, X.; Lo, D.; Wang, X.; and Grundy, J. 2019. Neural Network-Based Detection of Self-Admitted Technical Debt: From Performance to Explainability. Transactions on Software Engineering and Methodology 28: 1–45.
    Google ScholarLocate open access versionFindings
  • Sadu, A. 2019. Automatic Detection of Outdated Comments in Open Source Java Projects. Master’s thesis, Universidad Politecnica de Madrid.
    Google ScholarFindings
  • Svensson, A. 2015. Reducing outdated and inconsistent code comments during software development: The comment validator program. Master’s thesis, Uppsala University.
    Google ScholarFindings
  • Tan, L.; Yuan, D.; Krishna, G.; and Zhou, Y. 2007. /*iComment: Bugs or Bad Comments?*/. In Symposium on Operating Systems Principles, 145–158.
    Google ScholarLocate open access versionFindings
  • Tan, L.; Zhou, Y.; and Padioleau, Y. 2011. aComment: Mining Annotations from Comments and Code to Detect Interrupt Related Concurrency Bugs. In International Conference on Software Engineering, 11–20.
    Google ScholarLocate open access versionFindings
  • Tan, S. H.; Marinov, D.; Tan, L.; and Leavens, G. T. 2012. @tComment: Testing Javadoc Comments to Detect Comment-Code Inconsistencies. In International Conference on Software Testing, Verification and Validation, 260–269.
    Google ScholarLocate open access versionFindings
  • Thunes, C. 2020. Javalang. https://pypi.org/project/javalang/.
    Findings
  • Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is All you Need. In Conference on Neural Information Processing Systems, 5998–6008.
    Google ScholarLocate open access versionFindings
  • Wen, F.; Nagy, C.; Bavota, G.; and Lanza, M. 2019. A Large-Scale Empirical Study on Code-Comment Inconsistencies. In International Conference on Program Comprehension, 53–64.
    Google ScholarLocate open access versionFindings
  • Xu, W.; Napoles, C.; Pavlick, E.; Chen, Q.; and Callison-Burch, C. 2016. Optimizing Statistical Machine Translation for Text Simplification. Transactions of the Association for Computational Linguistics 4: 401–415.
    Google ScholarLocate open access versionFindings
  • Xuan, H. N. T.; Hieu, V. C.; and Le, A.-C. 2018. Adding External Features to Convolutional Neural Network for Aspect-based Sentiment Analysis. In Conference on Information and Computer Science, 53–59.
    Google ScholarLocate open access versionFindings
  • Yin, P.; Neubig, G.; Allamanis, M.; Brockschmidt, M.; and Gaunt, A. L. 2019. Learning to Represent Edits. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Zhou, Y.; Ruihang, G.; Taolue, C.; Zhiqiu, H.; Sebastiano, P.; and Harald, G. 2017. Analyzing APIs Documentation and Code to Detect Directive Defects. In International Conference on Software Engineering, 27–37.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments