Structural Language Models of Code
ICML, pp. 245-256, 2020.
EI
Weibo:
Abstract:
We address the problem of any-code completion – generating a missing piece of source code in a given program without any restriction on the vocabulary or structure. We introduce a new approach to any-code completion that leverages the strict syntax of programming languages to model a code snippet as a tree – structural language modeling (...More
Code:
Data:
Introduction
- Code completion is the problem of generating code given its surrounding code as context.
- The authors introduce the task of any-code completion – generating code in a general-purpose programming language without any restriction on its vocabulary or structure.
- Any-code completion generalizes the restricted completion task of Brockschmidt et al (2019), in which the target code contained only primitive types and excluded user-defined functions.
Highlights
- Code completion is the problem of generating code given its surrounding code as context
- Any-Code Completion: Java Table 1 shows that our structural language modeling achieves over 1.1% and 0.78% better acc@1 and acc@5 over the two strongest baselines
- As our structural language modeling model performs better than Paths→Paths, this ablation shows the importance of joint modeling of the context and the target subtree by parameter tying
- We presented a novel approach for any-code completion: joint modeling of an abstract syntax tree and its missing subtree using a structural language model
- Our model outperforms a variety of strong baselines, including programming language-oriented models and strong NMT models applied in our settings
- We believe that structural language modeling enables a wide range of future applications, to how language modeling research has contributed to NLP in recent years
Results
- Any-Code Completion: Java Table 1 shows that the SLM achieves over 1.1% and 0.78% better acc@1 and acc@5 over the two strongest baselines.
- Each of Paths→Paths and the seq2seq baselines (Table 1) performs better than Paths→Seq and Seq→Path; this shows the importance of using the same type of encoder and decoder for any-code completion, rather than combining “an optimal encoder” with “an optimal decoder”.
- This shows that dynamically attending to the context paths given the current root path is crucial
Conclusion
- The authors presented a novel approach for any-code completion: joint modeling of an AST and its missing subtree using a structural language model.
- The authors' approach has a variety of direct applications such as code completion, detecting and fixing unlikely existing code, and re-ranking solutions produced by another synthesizer or solver.
- To these ends, the authors make all the code, datasets, and trained models publicly available
Summary
Introduction:
Code completion is the problem of generating code given its surrounding code as context.- The authors introduce the task of any-code completion – generating code in a general-purpose programming language without any restriction on its vocabulary or structure.
- Any-code completion generalizes the restricted completion task of Brockschmidt et al (2019), in which the target code contained only primitive types and excluded user-defined functions.
Results:
Any-Code Completion: Java Table 1 shows that the SLM achieves over 1.1% and 0.78% better acc@1 and acc@5 over the two strongest baselines.- Each of Paths→Paths and the seq2seq baselines (Table 1) performs better than Paths→Seq and Seq→Path; this shows the importance of using the same type of encoder and decoder for any-code completion, rather than combining “an optimal encoder” with “an optimal decoder”.
- This shows that dynamically attending to the context paths given the current root path is crucial
Conclusion:
The authors presented a novel approach for any-code completion: joint modeling of an AST and its missing subtree using a structural language model.- The authors' approach has a variety of direct applications such as code completion, detecting and fixing unlikely existing code, and re-ranking solutions produced by another synthesizer or solver.
- To these ends, the authors make all the code, datasets, and trained models publicly available
Tables
- Table1: Results on any-code completion in Java
- Table2: Results on restricted completion in C#
- Table3: Ablations on any-code completion in Java
Related work
- Generalizing Previous Approaches Our approach frames code generation as predicting the next node in all partial AST paths. This simple framing generalizes most previous work, without hand-crafted edges and special actions:
• Models that use information about ancestor nodes only (Rabinovich et al, 2017), as well as the “Parent
Feeding” of Yin & Neubig (2017), are generalized by our model, since all paths that go into a node at pass through its parent, and the path from the root is the attention query. • The “previous action encoding” of Yin & Neubig (2017) is also a special case of our approach, because St contains the paths starting from the previously expanded leaves of Ap into the currently expanded node π (at), such as path3 in Figure 2(e). • The “context node” of PHOG (Bielik et al, 2016) is just one of the previously-traversed leaf nodes in aAllamanis et al (2018) further defines data-flow and control-flow graph edges such as “ComputedFrom” and “GuardedByNegation”. Most of these relations can be expressed as partial AST paths without manually designing them.
Reference
- Aharoni, R. and Goldberg, Y. Towards string-to-tree neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 132–140, 2017.
- Allamanis, M. The adverse effects of code duplication in machine learning models of code. In Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, pp. 143–153. ACM, 2019.
- Allamanis, M., Tarlow, D., Gordon, A., and Wei, Y. Bimodal modelling of source code and natural language. In International conference on machine learning, pp. 2123–2132, 2015.
- Allamanis, M., Peng, H., and Sutton, C. A convolutional attention network for extreme summarization of source code. In International conference on machine learning, pp. 2091–2100, 2016.
- Allamanis, M., Brockschmidt, M., and Khademi, M. Learning to represent programs with graphs. In International Conference on Learning Representations, 2018.
- Alon, U., Zilberstein, M., Levy, O., and Yahav, E. A general path-based representation for predicting program properties. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 404–419, 2018.
- Alon, U., Brody, S., Levy, O., and Yahav, E. code2seq: Generating sequences from structured representations of code. In International Conference on Learning Representations, 2019a.
- Alon, U., Zilberstein, M., Levy, O., and Yahav, E. code2vec: Learning distributed representations of code. Proceedings of the ACM on Programming Languages, 3 (POPL):1–29, 2019b.
- Amodio, M., Chaudhuri, S., and Reps, T. Neural attribute machines for program generation. arXiv preprint arXiv:1705.09231, 2017.
- Balog, M., Gaunt, A. L., Brockschmidt, M., Nowozin, S., and Tarlow, D. Deepcoder: Learning to write programs. In International Conference on Learning Representations, 2017.
- Bielik, P., Raychev, V., and Vechev, M. Phog: probabilistic model for code. In International Conference on Machine Learning, pp. 2933–2942, 2016.
- Brockschmidt, M., Allamanis, M., Gaunt, A. L., and Polozov, O. Generative code modeling with graphs. In International Conference on Learning Representations, 2019.
- Brody, S., Alon, U., and Yahav, E. Neural edit completion. arXiv preprint arXiv:2005.13209, 2020.
- Chen, X., Liu, C., and Song, D. Tree-to-tree neural networks for program translation. In Advances in Neural Information Processing Systems, pp. 2547–2557, 2018.
- Cvitkovic, M., Singh, B., and Anandkumar, A. Open vocabulary learning on source code with a graph-structured cache. In International Conference on Machine Learning, pp. 1475–1485, 2019.
- Devlin, J., Uesato, J., Bhupatiraju, S., Singh, R., Mohamed, A.-r., and Kohli, P. Robustfill: Neural program learning under noisy i/o. In International Conference on Machine Learning, pp. 990–998, 2017.
- Dong, L. and Lapata, M. Coarse-to-fine decoding for neural semantic parsing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 731–742, 2018.
- Ellis, K., Nye, M., Pu, Y., Sosa, F., Tenenbaum, J., and Solar-Lezama, A. Write, execute, assess: Program synthesis with a repl. In Advances in Neural Information Processing Systems, pp. 9169–9178, 2019.
- Fernandes, P., Allamanis, M., and Brockschmidt, M. Structured neural summarization. In International Conference on Learning Representations, 2019.
- Gaunt, A. L., Brockschmidt, M., Kushman, N., and Tarlow, D. Differentiable programs with neural libraries. In International Conference on Machine Learning, pp. 1213–1222, 2017.
- Gu, J., Lu, Z., Li, H., and Li, V. O. Incorporating copying mechanism in sequence-to-sequence learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1631–1640, 2016.
- Gulwani, S. Automating string processing in spreadsheets using input-output examples. In Proceedings of the 38th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pp. 317–330, 2011.
- Gulwani, S., Polozov, O., Singh, R., et al. Program synthesis. Foundations and Trends® in Programming Languages, 4(1-2):1–119, 2017.
- Iyer, S., Konstas, I., Cheung, A., and Zettlemoyer, L. Mapping language to code in programmatic context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1643–1652, 2018.
- Iyer, S., Cheung, A., and Zettlemoyer, L. Learning programmatic idioms for scalable semantic parsing. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5429–5438, 2019.
- Kingma, D. and Ba, J. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
- Klein, G., Kim, Y., Deng, Y., Senellart, J., and Rush, A. M. Opennmt: Open-source toolkit for neural machine translation. In Proceedings of ACL 2017, System Demonstrations, pp. 67–72, 2017.
- Kulal, S., Pasupat, P., Chandra, K., Lee, M., Padon, O., Aiken, A., and Liang, P. S. Spoc: Search-based pseudocode to code. In Advances in Neural Information Processing Systems, pp. 11906–11917, 2019.
- Ling, W., Blunsom, P., Grefenstette, E., Hermann, K. M., Kocisky, T., Wang, F., and Senior, A. Latent predictor networks for code generation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 599– 609, 2016.
- Luong, M.-T., Pham, H., and Manning, C. D. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1412– 1421, 2015.
- Maddison, C. and Tarlow, D. Structured generative models of natural source code. In International Conference on Machine Learning, pp. 649–657, 2014.
- Murali, V., Qi, L., Chaudhuri, S., and Jermaine, C. Neural sketch learning for conditional program generation. In International Conference on Learning Representations, 2018.
- Oda, Y., Fudaba, H., Neubig, G., Hata, H., Sakti, S., Toda, T., and Nakamura, S. Learning to generate pseudo-code from source code using statistical machine translation. In Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering, pp. 574– 584, 2015.
- Parisotto, E., Mohamed, A.-r., Singh, R., Li, L., Zhou, D., and Kohli, P. Neuro-symbolic program synthesis. In International Conference on Learning Representations, 2017.
- Pnueli, A. and Rosner, R. On the synthesis of a reactive module. In Proceedings of the 16th ACM SIGPLANSIGACT symposium on Principles of programming languages, pp. 179–190. ACM, 1989.
- Polozov, O. and Gulwani, S. Flashmeta: a framework for inductive program synthesis. In Proceedings of the 2015 ACM SIGPLAN International Conference on ObjectOriented Programming, Systems, Languages, and Applications, pp. 107–126, 2015.
- Rabinovich, M., Stern, M., and Klein, D. Abstract syntax networks for code generation and semantic parsing. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1139–1149, 2017.
- Raychev, V., Bielik, P., Vechev, M., and Krause, A. Learning programs from noisy data. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pp. 761–774, 2016.
- Si, X., Yang, Y., Dai, H., Naik, M., and Song, L. Learning a meta-solver for syntax-guided program synthesis. In International Conference on Learning Representations, 2019.
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems, pp. 6000–6010, 2017.
- Waldinger, R. J. and Lee, R. C. Prow: A step toward automatic program writing. In Proceedings of the 1st international joint conference on Artificial intelligence, pp. 241–252, 1969.
- Xiao, C., Dymetman, M., and Gardent, C. Sequence-based structured prediction for semantic parsing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1341–1350, 2016.
- Yin, P. and Neubig, G. A syntactic neural model for general-purpose code generation. In Proceedings of the
- 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 440– 450, 2017.
- Yin, P., Neubig, G., Allamanis, M., Brockschmidt, M., and Gaunt, A. L. Learning to represent edits. In International Conference on Learning Representations, 2019.
- Young, H., Bastani, O., and Naik, M. Learning neurosymbolic generative models via program synthesis. In International Conference on Machine Learning, pp. 7144– 7153, 2019.
- Yu, T., Li, Z., Zhang, Z., Zhang, R., and Radev, D. Typesql: Knowledge-based type-aware neural text-to-sql generation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 588–594, 2018.
- Zhao, R., Bieber, D., Swersky, K., and Tarlow, D. Neural networks for modeling source code edits. arXiv preprint arXiv:1904.02818, 2019.
Full Text
Tags
Comments