Learning to Update Natural Language Comments Based on Code Changes

ACL, pp. 1853-1868, 2020.

Cited by: 0|Bibtex|Views70
EI
Other Links: arxiv.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
We have addressed the novel task of automatically updating an existing programming comment based on changes to the related code

Abstract:

We formulate the novel task of automatically updating an existing natural language comment based on changes in the body of code it accompanies. We propose an approach that learns to correlate changes across two distinct language representations, to generate a sequence of edits that are applied to the existing comment to reflect the sour...More
0
Introduction
  • Software developers include natural language comments alongside source code as a way to document various aspects of the code such as functionality, use cases, pre-conditions, and post-conditions.
  • Inconsistency between code and comments can lead time-wasting confusion in tight project schedules (Hu et al, 2018) but can result in bugs (Tan et al, 2007).
  • To address this problem, the authors propose an approach that can automatically suggest comment updates when the associated methods are changed
Highlights
  • Software developers include natural language comments alongside source code as a way to document various aspects of the code such as functionality, use cases, pre-conditions, and post-conditions
  • We report automatic metrics averaged across three random initializations for all learned models, and use bootstrap tests (Berg-Kirkpatrick et al, 2012) for statistical significance
  • Since automatic metrics have not yet been explored in the context of the new task we are proposing, we find it necessary to conduct human evaluation and study whether these metrics are consistent with human judgment
  • We have addressed the novel task of automatically updating an existing programming comment based on changes to the related code
  • We find that our model outperforms multiple rule-based baselines and comment generation models, with respect to several automatic metrics and human evaluation
Methods
  • Empirical Methods in Natural Language

    Processing, pages 2122–2132.

    Pablo Loyola, Edison Marrese-Taylor, and Yutaka Matsuo. 2017.
  • Empirical Methods in Natural Language.
  • Pablo Loyola, Edison Marrese-Taylor, and Yutaka Matsuo.
  • A neural architecture for generating natural language descriptions from source code changes.
  • In Annual Meeting of the Association for Computational Linguistics, pages 287–292.
  • Thang Luong, Hieu Pham, and Christopher D.
  • Effective approaches to attention-based neural machine translation.
  • In Conference on Empirical Methods in Natural Language Processing, pages 1412–1421
Results
  • The authors report automatic metrics averaged across three random initializations for all learned models, and use bootstrap tests (Berg-Kirkpatrick et al, 2012) for statistical significance.
  • Public Complex getComplex() { return get(); Previous Version Updated Version comment suggestions that were produced by each model.
  • Users selected none of the suggested comments 55% of the time, indicating there are many cases for which either the existing comment did not need updating, or comments produced by all models were poor.
  • Despite the efforts to minimize such cases in the dataset through rule-based filtering techniques, the authors found that many remain
  • This suggests that it would be beneficial to train a classifier that first determines whether a comment needs to be updated before proposing a revision.
  • The cases for which the existing comment does need to be updated but none of the models produce reasonable predictions illustrate the scope for improvement for the proposed task
Conclusion
  • The authors have addressed the novel task of automatically updating an existing programming comment based on changes to the related code.
  • The authors designed a new approach for this task which aims to correlate cross-modal edits in order to generate a sequence of edit actions specifying how the comment should be updated.
  • The authors find that the model outperforms multiple rule-based baselines and comment generation models, with respect to several automatic metrics and human evaluation
Summary
  • Introduction:

    Software developers include natural language comments alongside source code as a way to document various aspects of the code such as functionality, use cases, pre-conditions, and post-conditions.
  • Inconsistency between code and comments can lead time-wasting confusion in tight project schedules (Hu et al, 2018) but can result in bugs (Tan et al, 2007).
  • To address this problem, the authors propose an approach that can automatically suggest comment updates when the associated methods are changed
  • Methods:

    Empirical Methods in Natural Language

    Processing, pages 2122–2132.

    Pablo Loyola, Edison Marrese-Taylor, and Yutaka Matsuo. 2017.
  • Empirical Methods in Natural Language.
  • Pablo Loyola, Edison Marrese-Taylor, and Yutaka Matsuo.
  • A neural architecture for generating natural language descriptions from source code changes.
  • In Annual Meeting of the Association for Computational Linguistics, pages 287–292.
  • Thang Luong, Hieu Pham, and Christopher D.
  • Effective approaches to attention-based neural machine translation.
  • In Conference on Empirical Methods in Natural Language Processing, pages 1412–1421
  • Results:

    The authors report automatic metrics averaged across three random initializations for all learned models, and use bootstrap tests (Berg-Kirkpatrick et al, 2012) for statistical significance.
  • Public Complex getComplex() { return get(); Previous Version Updated Version comment suggestions that were produced by each model.
  • Users selected none of the suggested comments 55% of the time, indicating there are many cases for which either the existing comment did not need updating, or comments produced by all models were poor.
  • Despite the efforts to minimize such cases in the dataset through rule-based filtering techniques, the authors found that many remain
  • This suggests that it would be beneficial to train a classifier that first determines whether a comment needs to be updated before proposing a revision.
  • The cases for which the existing comment does need to be updated but none of the models produce reasonable predictions illustrate the scope for improvement for the proposed task
  • Conclusion:

    The authors have addressed the novel task of automatically updating an existing programming comment based on changes to the related code.
  • The authors designed a new approach for this task which aims to correlate cross-modal edits in order to generate a sequence of edit actions specifying how the comment should be updated.
  • The authors find that the model outperforms multiple rule-based baselines and comment generation models, with respect to several automatic metrics and human evaluation
Tables
  • Table1: Number of examples, projects, and edit actions; average similarity between Mold and Mnew as the ratio of overlap; average similarity between Cold and Cnew as the ratio of overlap; number of unique code tokens and mean and median number of tokens in a method; and number of unique comment tokens and mean and median number of tokens in a comment
  • Table2: Exact match, METEOR, BLEU-4, SARI, and GLEU scores. Scores for which the difference in performance is not statistically significant (p < 0.05) are indicated with matching symbols
  • Table3: Percentage of annotations for which users selected comment suggestions produced by each model. All differences are statistically significant (p < 0.05)
  • Table4: Exact match, METEOR, BLEU-4, SARI, and GLEU for various combinations of code input and target comment output configurations. Features and reranking are disabled for all models. Scores for which the difference in performance is not statistically significant (p < 0.05) are indicated with matching symbols
  • Table5: Exact match, METEOR, BLEU-4, SARI, and GLEU scores of ablated models. Scores for which the difference in performance is not statistically significant (p < 0.05) are indicated with matching symbols
  • Table6: Total number of edit actions; average number of edit actions per example; percentage of total actions that is accounted by each edit action type
  • Table7: Examples from open-source software projects. For each example, we show the diff between the two versions of the method (left: old version, right: new version, diff lines are highlighted), the existing @return comment prior to being updated (left), and predictions made by the return type substitution w/ null handling baseline, reranked generation model, and reranked edit model, and the gold updated comment (right, from top to bottom)
Download tables as Excel
Related work
  • Learning from source code changes: Lee et al (2019) use rule-based techniques to automatically detect and revise outdated API names in code documentation; however, their approach cannot be extended to full natural language comments that are the focus of this work. Zhai et al (2020) propose a technique for updating incomplete and buggy comments by propagating comments from different code elements (e.g., variables, methods, classes) based on program analysis and several heuristics. Rather than simply copying a related comment, we aim to revise an outdated comment by reasoning about code changes. Yin et al (2019) present an approach for learning structural and semantic properties of source code edits so that they can be generalized to new code inputs. Similar to their work, we learn vector representations from source code changes; however, unlike their setting, we apply these representations to natural language. Prior work in automatic commit message generation aims to learn from code changes in order to generate a natural language summary of these changes (Loyola et al, 2017; Jiang et al, 2017; Xu et al, 2019). Instead of generating natural language content from scratch as done in their work, we focus on applying edits to existing natural language text. We also show that generating a comment from scratch does not perform as well as our proposed edit model for the comment update setting. Editing natural language text: Approaches for editing natural language text have been studied extensively through tasks such as sentence simplification (Dong et al, 2019), style transfer (Li et al, 2018), grammatical error correction (Awasthi et al., 2019), and language modeling (Guu et al, 2018). The focus of this prior work is to revise sentences to conform to stylistic and grammatical conventions, and it does not generally consider broader contextual constraints. On the contrary, our goal is not to make cosmetic revisions to a given span of text, but rather amend its semantic meaning to be in sync with the content of a separate body of information on which it is dependent. More recently, Shah et al (2020) proposed an approach for rewriting an outdated sentence based on a sentence stating a new factual claim, which is more closely aligned with our task. However, in our case, the separate body of information is not natural language and is generally much longer than a single sentence.
Funding
  • This work was partially supported by a Google Faculty Research Award and the US National Science Foundation under Grant Nos
Reference
  • Miltiadis Allamanis. 2019. The adverse effects of code duplication in machine learning models of code. In SPLASH, Onward!, pages 143–153.
    Google ScholarLocate open access versionFindings
  • Miltiadis Allamanis, Hao Peng, and Charles Sutton. 2016. A convolutional attention network for extreme summarization of source code. In International Conference on Machine Learning, pages 2091–2100.
    Google ScholarLocate open access versionFindings
  • Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. 2019. code2seq: Generating sequences from structured representations of code. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Fernando Alva-Manchego, Louis Martin, Carolina Scarton, and Lucia Specia. 2019. EASSE: Easier automatic sentence simplification evaluation. In Conference on Empirical Methods in Natural Language Processing and International Joint Conference on Natural Language Processing: System Demonstrations, pages 49–54.
    Google ScholarLocate open access versionFindings
  • Abhijeet Awasthi, Sunita Sarawagi, Rasna Goyal, Sabyasachi Ghosh, and Vihari Piratla. 2019. Parallel iterative edit models for local sequence transduction. In Conference on Empirical Methods in Natural Language Processing and International Joint Conference on Natural Language Processing, pages 4251–4261.
    Google ScholarLocate open access versionFindings
  • Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for MT evaluation with improved correlation with human judgments. In Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65– 72.
    Google ScholarLocate open access versionFindings
  • Taylor Berg-Kirkpatrick, David Burkett, and Dan Klein. 2012. An empirical investigation of statistical significance in NLP. In Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 995–1005.
    Google ScholarLocate open access versionFindings
  • Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Conference on Empirical Methods in Natural Language Processing, pages 1724–1734.
    Google ScholarLocate open access versionFindings
  • Yue Dong, Zichao Li, Mehdi Rezagholizadeh, and Jackie Chi Kit Cheung. 201EditNTS: An neural programmer-interpreter model for sentence simplification through explicit editing. In Annual Meeting of the Association for Computational Linguistics, pages 3393–3402.
    Google ScholarLocate open access versionFindings
  • Patrick Fernandes, Miltiadis Allamanis, and Marc Brockschmidt. 2019. Structured neural summarization. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Kelvin Guu, Tatsunori B Hashimoto, Yonatan Oren, and Percy Liang. 2018. Generating sentences by editing prototypes. Transactions of the Association for Computational Linguistics, 6:437–450.
    Google ScholarLocate open access versionFindings
  • Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. 2018. Deep code comment generation. In International Conference on Program Comprehension, pages 200– 210.
    Google ScholarLocate open access versionFindings
  • Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. Summarizing source code using a neural attention model. In Annual Meeting of the Association for Computational Linguistics, pages 2073–2083.
    Google ScholarLocate open access versionFindings
  • Siyuan Jiang, Ameer Armaly, and Collin McMillan. 2017. Automatically generating commit messages from diffs using neural machine translation. In International Conference on Automated Software Engineering, pages 135–146.
    Google ScholarLocate open access versionFindings
  • Wei-Jen Ko, Greg Durrett, and Junyi Jessy Li. 2019. Linguistically-informed specificity and semantic plausibility for dialogue generation. In North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3456–3466.
    Google ScholarLocate open access versionFindings
  • Klaus Krippendorff. 2011. Computing Krippendorff’s alpha reliability. Technical report, University of Pennsylvania.
    Google ScholarFindings
  • Reno Kriz, João Sedoc, Marianna Apidianaki, Carolina Zheng, Gaurav Kumar, Eleni Miltsakaki, and Chris Callison-Burch. 2019. Complexity-weighted loss and diverse reranking for sentence simplification. In North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3137–3147.
    Google ScholarLocate open access versionFindings
  • Seonah Lee, Rongxin Wu, S.C. Cheung, and Sungwon Kang. 2019. Automatic detection and update suggestion for outdated API names in documentation. Transactions on Software Engineering.
    Google ScholarFindings
  • Juncen Li, Robin Jia, He He, and Percy Liang. 2018.
    Google ScholarFindings
  • Delete, retrieve, generate: a simple approach to sentiment and style transfer. In North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1865–1874.
    Google ScholarLocate open access versionFindings
  • Junyi Jessy Li and Ani Nenkova. 2015. Fast and accurate prediction of sentence specificity. In AAAI Conference on Artificial Intelligence, pages 2281–2287.
    Google ScholarLocate open access versionFindings
  • Yuding Liang and Kenny Q. Zhu. 2018. Automatic generation of text descriptive comments for code blocks. In AAAI Conference on Artificial Intelligence, pages 5229–5236.
    Google ScholarLocate open access versionFindings
  • Chia-Wei Liu, Ryan Lowe, Iulian Serban, Mike Noseworthy, Laurent Charlin, and Joelle Pineau. 2016. How NOT to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. In Conference on
    Google ScholarLocate open access versionFindings
  • Pablo Loyola, Edison Marrese-Taylor, and Yutaka Matsuo. 2017. A neural architecture for generating natural language descriptions from source code changes. In Annual Meeting of the Association for Computational Linguistics, pages 287–292.
    Google ScholarLocate open access versionFindings
  • Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In Conference on Empirical Methods in Natural Language Processing, pages 1412–1421.
    Google ScholarLocate open access versionFindings
  • Dana Movshovitz-Attias and William W. Cohen. 2013. Natural language models for predicting programming comments. In Annual Meeting of the Association for Computational Linguistics, pages 35–40.
    Google ScholarLocate open access versionFindings
  • Courtney Napoles, Keisuke Sakaguchi, Matt Post, and Joel Tetreault. 2015. Ground truth for grammatical error correction metrics. In Annual Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing, pages 588–593.
    Google ScholarLocate open access versionFindings
  • Graham Neubig, Makoto Morishita, and Satoshi Nakamura. 2015. Neural reranking improves subjective quality of machine translation: NAIST at WAT2015. In Workshop on Asian Translation, pages 35–41.
    Google ScholarLocate open access versionFindings
  • Sheena Panthaplackel, Milos Gligoric, Raymond J. Mooney, and Junyi Jessy Li. 2020. Associating natural language comment and source code entities. In AAAI Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Annual Meeting of the Association for Computational Linguistics, pages 311–318.
    Google ScholarLocate open access versionFindings
  • Rebecca Passonneau. 2006. Measuring agreement on set-valued items (MASI) for semantic and pragmatic annotation. In International Conference on Language Resources and Evaluation.
    Google ScholarLocate open access versionFindings
  • Inderjot Kaur Ratol and Martin P. Robillard. 2017. Detecting fragile comments. International Conference on Automated Software Engineering, pages 112– 122.
    Google ScholarLocate open access versionFindings
  • Darsh J. Shah, Tal Schuster, and Regina Barzilay. 2020. Automatic fact-guided sentence modification. In AAAI Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • Richard Shin, Illia Polosukhin, and Dawn Song. 2018. Towards specification-directed program repair. In International Conference on Learning Representations Workshop.
    Google ScholarLocate open access versionFindings
  • Akhilesh Sudhakar, Bhargav Upadhyay, and Arjun Maheswaran. 2019. “Transforming” delete, retrieve, generate approach for controlled text style transfer. In Conference on Empirical Methods in Natural Language Processing and the International Joint
    Google ScholarLocate open access versionFindings
  • Lin Tan, Ding Yuan, Gopal Krishna, and Yuanyuan Zhou. 2007. /*iComment: Bugs or bad comments?*/. In Symposium on Operating Systems Principles, pages 145–158.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008.
    Google ScholarLocate open access versionFindings
  • Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer networks. In Advances in Neural Information Processing Systems, pages 2692–2700.
    Google ScholarLocate open access versionFindings
  • Shengbin Xu, Yuan Yao, Feng Xu, Tianxiao Gu, Hanghang Tong, and Jian Lu. 2019. Commit message generation for source code changes. In International Joint Conference on Artificial Intelligence, pages 3975–3981.
    Google ScholarLocate open access versionFindings
  • Wei Xu, Courtney Napoles, Ellie Pavlick, Quanze Chen, and Chris Callison-Burch. 2016. Optimizing statistical machine translation for text simplification. Transactions of the Association for Computational Linguistics, 4:401–415.
    Google ScholarLocate open access versionFindings
  • Ziyu Yao, Daniel S. Weld, Wei-Peng Chen, and Huan Sun. 2018. StaQC: A systematically mined question-code dataset from Stack Overflow. In International Conference on World Wide Web, pages 1693–1703.
    Google ScholarLocate open access versionFindings
  • Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilescu, and Graham Neubig. 2018. Learning to mine aligned code and natural language pairs from Stack Overflow. In International Conference on Mining Software Repositories, pages 476–486.
    Google ScholarLocate open access versionFindings
  • Pengcheng Yin, Graham Neubig, Miltiadis Allamanis, Marc Brockschmidt, and Alexander L. Gaunt. 2019. Learning to represent edits. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Juan Zhai, Xiangzhe Xu, Yu Shi, Guanhong Tao, Minxue Pan, Shiqing Ma, Lei Xu, Weifeng Zhang, Lin Tan, and Xiangyu Zhang. 2020. CPC: Automatically classifying and propagating natural language comments via program analysis. In International Conference on Software Engineering.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments