Structural Language Models of Code

Uri Alon
Uri Alon
Roy Sadaka
Roy Sadaka

ICML, pp. 245-256, 2020.

Cited by: 2|Bibtex|Views41
EI
Other Links: academic.microsoft.com|dblp.uni-trier.de
Weibo:
We believe that structural language modeling enables a wide range of future applications, to how language modeling research has contributed to NLP in recent years

Abstract:

We address the problem of any-code completion – generating a missing piece of source code in a given program without any restriction on the vocabulary or structure. We introduce a new approach to any-code completion that leverages the strict syntax of programming languages to model a code snippet as a tree – structural language modeling (...More

Code:

Data:

0
Introduction
  • Code completion is the problem of generating code given its surrounding code as context.
  • The authors introduce the task of any-code completion – generating code in a general-purpose programming language without any restriction on its vocabulary or structure.
  • Any-code completion generalizes the restricted completion task of Brockschmidt et al (2019), in which the target code contained only primitive types and excluded user-defined functions.
Highlights
  • Code completion is the problem of generating code given its surrounding code as context
  • Any-Code Completion: Java Table 1 shows that our structural language modeling achieves over 1.1% and 0.78% better acc@1 and acc@5 over the two strongest baselines
  • As our structural language modeling model performs better than Paths→Paths, this ablation shows the importance of joint modeling of the context and the target subtree by parameter tying
  • We presented a novel approach for any-code completion: joint modeling of an abstract syntax tree and its missing subtree using a structural language model
  • Our model outperforms a variety of strong baselines, including programming language-oriented models and strong NMT models applied in our settings
  • We believe that structural language modeling enables a wide range of future applications, to how language modeling research has contributed to NLP in recent years
Results
  • Any-Code Completion: Java Table 1 shows that the SLM achieves over 1.1% and 0.78% better acc@1 and acc@5 over the two strongest baselines.
  • Each of Paths→Paths and the seq2seq baselines (Table 1) performs better than Paths→Seq and Seq→Path; this shows the importance of using the same type of encoder and decoder for any-code completion, rather than combining “an optimal encoder” with “an optimal decoder”.
  • This shows that dynamically attending to the context paths given the current root path is crucial
Conclusion
  • The authors presented a novel approach for any-code completion: joint modeling of an AST and its missing subtree using a structural language model.
  • The authors' approach has a variety of direct applications such as code completion, detecting and fixing unlikely existing code, and re-ranking solutions produced by another synthesizer or solver.
  • To these ends, the authors make all the code, datasets, and trained models publicly available
Summary
  • Introduction:

    Code completion is the problem of generating code given its surrounding code as context.
  • The authors introduce the task of any-code completion – generating code in a general-purpose programming language without any restriction on its vocabulary or structure.
  • Any-code completion generalizes the restricted completion task of Brockschmidt et al (2019), in which the target code contained only primitive types and excluded user-defined functions.
  • Results:

    Any-Code Completion: Java Table 1 shows that the SLM achieves over 1.1% and 0.78% better acc@1 and acc@5 over the two strongest baselines.
  • Each of Paths→Paths and the seq2seq baselines (Table 1) performs better than Paths→Seq and Seq→Path; this shows the importance of using the same type of encoder and decoder for any-code completion, rather than combining “an optimal encoder” with “an optimal decoder”.
  • This shows that dynamically attending to the context paths given the current root path is crucial
  • Conclusion:

    The authors presented a novel approach for any-code completion: joint modeling of an AST and its missing subtree using a structural language model.
  • The authors' approach has a variety of direct applications such as code completion, detecting and fixing unlikely existing code, and re-ranking solutions produced by another synthesizer or solver.
  • To these ends, the authors make all the code, datasets, and trained models publicly available
Tables
  • Table1: Results on any-code completion in Java
  • Table2: Results on restricted completion in C#
  • Table3: Ablations on any-code completion in Java
Download tables as Excel
Related work
  • Generalizing Previous Approaches Our approach frames code generation as predicting the next node in all partial AST paths. This simple framing generalizes most previous work, without hand-crafted edges and special actions:

    • Models that use information about ancestor nodes only (Rabinovich et al, 2017), as well as the “Parent

    Feeding” of Yin & Neubig (2017), are generalized by our model, since all paths that go into a node at pass through its parent, and the path from the root is the attention query. • The “previous action encoding” of Yin & Neubig (2017) is also a special case of our approach, because St contains the paths starting from the previously expanded leaves of Ap into the currently expanded node π (at), such as path3 in Figure 2(e). • The “context node” of PHOG (Bielik et al, 2016) is just one of the previously-traversed leaf nodes in aAllamanis et al (2018) further defines data-flow and control-flow graph edges such as “ComputedFrom” and “GuardedByNegation”. Most of these relations can be expressed as partial AST paths without manually designing them.
Reference
  • Aharoni, R. and Goldberg, Y. Towards string-to-tree neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 132–140, 2017.
    Google ScholarLocate open access versionFindings
  • Allamanis, M. The adverse effects of code duplication in machine learning models of code. In Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, pp. 143–153. ACM, 2019.
    Google ScholarLocate open access versionFindings
  • Allamanis, M., Tarlow, D., Gordon, A., and Wei, Y. Bimodal modelling of source code and natural language. In International conference on machine learning, pp. 2123–2132, 2015.
    Google ScholarLocate open access versionFindings
  • Allamanis, M., Peng, H., and Sutton, C. A convolutional attention network for extreme summarization of source code. In International conference on machine learning, pp. 2091–2100, 2016.
    Google ScholarLocate open access versionFindings
  • Allamanis, M., Brockschmidt, M., and Khademi, M. Learning to represent programs with graphs. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • Alon, U., Zilberstein, M., Levy, O., and Yahav, E. A general path-based representation for predicting program properties. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 404–419, 2018.
    Google ScholarLocate open access versionFindings
  • Alon, U., Brody, S., Levy, O., and Yahav, E. code2seq: Generating sequences from structured representations of code. In International Conference on Learning Representations, 2019a.
    Google ScholarLocate open access versionFindings
  • Alon, U., Zilberstein, M., Levy, O., and Yahav, E. code2vec: Learning distributed representations of code. Proceedings of the ACM on Programming Languages, 3 (POPL):1–29, 2019b.
    Google ScholarLocate open access versionFindings
  • Amodio, M., Chaudhuri, S., and Reps, T. Neural attribute machines for program generation. arXiv preprint arXiv:1705.09231, 2017.
    Findings
  • Balog, M., Gaunt, A. L., Brockschmidt, M., Nowozin, S., and Tarlow, D. Deepcoder: Learning to write programs. In International Conference on Learning Representations, 2017.
    Google ScholarLocate open access versionFindings
  • Bielik, P., Raychev, V., and Vechev, M. Phog: probabilistic model for code. In International Conference on Machine Learning, pp. 2933–2942, 2016.
    Google ScholarLocate open access versionFindings
  • Brockschmidt, M., Allamanis, M., Gaunt, A. L., and Polozov, O. Generative code modeling with graphs. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Brody, S., Alon, U., and Yahav, E. Neural edit completion. arXiv preprint arXiv:2005.13209, 2020.
    Findings
  • Chen, X., Liu, C., and Song, D. Tree-to-tree neural networks for program translation. In Advances in Neural Information Processing Systems, pp. 2547–2557, 2018.
    Google ScholarLocate open access versionFindings
  • Cvitkovic, M., Singh, B., and Anandkumar, A. Open vocabulary learning on source code with a graph-structured cache. In International Conference on Machine Learning, pp. 1475–1485, 2019.
    Google ScholarLocate open access versionFindings
  • Devlin, J., Uesato, J., Bhupatiraju, S., Singh, R., Mohamed, A.-r., and Kohli, P. Robustfill: Neural program learning under noisy i/o. In International Conference on Machine Learning, pp. 990–998, 2017.
    Google ScholarLocate open access versionFindings
  • Dong, L. and Lapata, M. Coarse-to-fine decoding for neural semantic parsing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 731–742, 2018.
    Google ScholarLocate open access versionFindings
  • Ellis, K., Nye, M., Pu, Y., Sosa, F., Tenenbaum, J., and Solar-Lezama, A. Write, execute, assess: Program synthesis with a repl. In Advances in Neural Information Processing Systems, pp. 9169–9178, 2019.
    Google ScholarLocate open access versionFindings
  • Fernandes, P., Allamanis, M., and Brockschmidt, M. Structured neural summarization. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Gaunt, A. L., Brockschmidt, M., Kushman, N., and Tarlow, D. Differentiable programs with neural libraries. In International Conference on Machine Learning, pp. 1213–1222, 2017.
    Google ScholarLocate open access versionFindings
  • Gu, J., Lu, Z., Li, H., and Li, V. O. Incorporating copying mechanism in sequence-to-sequence learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1631–1640, 2016.
    Google ScholarLocate open access versionFindings
  • Gulwani, S. Automating string processing in spreadsheets using input-output examples. In Proceedings of the 38th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pp. 317–330, 2011.
    Google ScholarLocate open access versionFindings
  • Gulwani, S., Polozov, O., Singh, R., et al. Program synthesis. Foundations and Trends® in Programming Languages, 4(1-2):1–119, 2017.
    Google ScholarLocate open access versionFindings
  • Iyer, S., Konstas, I., Cheung, A., and Zettlemoyer, L. Mapping language to code in programmatic context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1643–1652, 2018.
    Google ScholarLocate open access versionFindings
  • Iyer, S., Cheung, A., and Zettlemoyer, L. Learning programmatic idioms for scalable semantic parsing. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5429–5438, 2019.
    Google ScholarLocate open access versionFindings
  • Kingma, D. and Ba, J. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
    Google ScholarLocate open access versionFindings
  • Klein, G., Kim, Y., Deng, Y., Senellart, J., and Rush, A. M. Opennmt: Open-source toolkit for neural machine translation. In Proceedings of ACL 2017, System Demonstrations, pp. 67–72, 2017.
    Google ScholarLocate open access versionFindings
  • Kulal, S., Pasupat, P., Chandra, K., Lee, M., Padon, O., Aiken, A., and Liang, P. S. Spoc: Search-based pseudocode to code. In Advances in Neural Information Processing Systems, pp. 11906–11917, 2019.
    Google ScholarLocate open access versionFindings
  • Ling, W., Blunsom, P., Grefenstette, E., Hermann, K. M., Kocisky, T., Wang, F., and Senior, A. Latent predictor networks for code generation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 599– 609, 2016.
    Google ScholarLocate open access versionFindings
  • Luong, M.-T., Pham, H., and Manning, C. D. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1412– 1421, 2015.
    Google ScholarLocate open access versionFindings
  • Maddison, C. and Tarlow, D. Structured generative models of natural source code. In International Conference on Machine Learning, pp. 649–657, 2014.
    Google ScholarLocate open access versionFindings
  • Murali, V., Qi, L., Chaudhuri, S., and Jermaine, C. Neural sketch learning for conditional program generation. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • Oda, Y., Fudaba, H., Neubig, G., Hata, H., Sakti, S., Toda, T., and Nakamura, S. Learning to generate pseudo-code from source code using statistical machine translation. In Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering, pp. 574– 584, 2015.
    Google ScholarLocate open access versionFindings
  • Parisotto, E., Mohamed, A.-r., Singh, R., Li, L., Zhou, D., and Kohli, P. Neuro-symbolic program synthesis. In International Conference on Learning Representations, 2017.
    Google ScholarLocate open access versionFindings
  • Pnueli, A. and Rosner, R. On the synthesis of a reactive module. In Proceedings of the 16th ACM SIGPLANSIGACT symposium on Principles of programming languages, pp. 179–190. ACM, 1989.
    Google ScholarLocate open access versionFindings
  • Polozov, O. and Gulwani, S. Flashmeta: a framework for inductive program synthesis. In Proceedings of the 2015 ACM SIGPLAN International Conference on ObjectOriented Programming, Systems, Languages, and Applications, pp. 107–126, 2015.
    Google ScholarLocate open access versionFindings
  • Rabinovich, M., Stern, M., and Klein, D. Abstract syntax networks for code generation and semantic parsing. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1139–1149, 2017.
    Google ScholarLocate open access versionFindings
  • Raychev, V., Bielik, P., Vechev, M., and Krause, A. Learning programs from noisy data. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pp. 761–774, 2016.
    Google ScholarLocate open access versionFindings
  • Si, X., Yang, Y., Dai, H., Naik, M., and Song, L. Learning a meta-solver for syntax-guided program synthesis. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems, pp. 6000–6010, 2017.
    Google ScholarLocate open access versionFindings
  • Waldinger, R. J. and Lee, R. C. Prow: A step toward automatic program writing. In Proceedings of the 1st international joint conference on Artificial intelligence, pp. 241–252, 1969.
    Google ScholarLocate open access versionFindings
  • Xiao, C., Dymetman, M., and Gardent, C. Sequence-based structured prediction for semantic parsing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1341–1350, 2016.
    Google ScholarLocate open access versionFindings
  • Yin, P. and Neubig, G. A syntactic neural model for general-purpose code generation. In Proceedings of the
    Google ScholarLocate open access versionFindings
  • 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 440– 450, 2017.
    Google ScholarFindings
  • Yin, P., Neubig, G., Allamanis, M., Brockschmidt, M., and Gaunt, A. L. Learning to represent edits. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Young, H., Bastani, O., and Naik, M. Learning neurosymbolic generative models via program synthesis. In International Conference on Machine Learning, pp. 7144– 7153, 2019.
    Google ScholarLocate open access versionFindings
  • Yu, T., Li, Z., Zhang, Z., Zhang, R., and Radev, D. Typesql: Knowledge-based type-aware neural text-to-sql generation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 588–594, 2018.
    Google ScholarLocate open access versionFindings
  • Zhao, R., Bieber, D., Swersky, K., and Tarlow, D. Neural networks for modeling source code edits. arXiv preprint arXiv:1904.02818, 2019.
    Findings
Full Text
Your rating :
0

 

Tags
Comments