Zero Shot Learning for Code Education: Rubric Sampling with Deep Learning Inference

national conference on artificial intelligence, 2019.

Cited by: 13|Bibtex|Views108
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
We introduce the zero shot feedback challenge

Abstract:

In modern computer science education, massive open online courses (MOOCs) log thousands of hours of data about how students solve coding challenges. Being so rich in data, these platforms have garnered the interest of the machine learning community, with many new algorithms attempting to autonomously provide feedback to help future studen...More

Code:

Data:

0
Introduction
  • The need for high quality education at scale poses a difficult challenge. The price of education per student is growing faster than economy-wide costs (Bowen 2012), limiting the resources available to support student learning.
  • When considering the rising need to provide adult retraining, the gap between the demand for education and the ability to provide is especially large.
  • MOOCs largely ignore an important ingredient for learning: high quality feedback.
  • The clear societal need, alongside massive amounts of data has led to a machine learning grand challenge: learn how to provide feedback for education at scale, especially in computer science due to its apparent structure and high demand
Highlights
  • The need for high quality education at scale poses a difficult challenge
  • In P1, using rubric sampling increases the F1 score by 0.31 in the body and 0.13 in the tail
  • We find that combining the MVAE with rubric sampling boosts the F1 by an additional 0.2, reaching 94% accuracy in P1 and 95% in P8
  • We introduce the zero shot feedback challenge
  • On a widely used platform, we show rubric sampling to far surpass the SOTA
Methods
  • The authors consider learning tasks given a dataset of n labeled examples, where each example has an input string xi and a target output vector yi = [yi,1, ..., yi,l] composed of l independent binary labels.
  • In this context, the authors assume each string represents a block-based program in Lisplike notation.
  • The authors evaluate on the entire set D
Results
  • Recreation of Human Labels

    Figure 4 reports a set of F1 scores, including human-level performance estimated from annotations.
  • The authors find that combining the MVAE with rubric sampling boosts the F1 by an additional 0.2, reaching 94% accuracy in P1 and 95% in P8.
  • With these scores, the authors are reasonably confident that for a new student, despite the likelihood that he/she will submit a never-beforeseen program, the authors will provide good feedback
Conclusion
  • The authors get closer to human level performance than previous SOTA. Any of the rubric sampling models beat the SOTA by at least 0.46 in P1 and 0.24 in P8, nearly tripling the F1 score in P1 and doubling in P8.
  • On a widely used platform, the authors show rubric sampling to far surpass the SOTA
  • The authors combine this with a generative model to cluster students, highlight code, and incorporate historical data.
  • This approach can scale feedback for real world use
Summary
  • Introduction:

    The need for high quality education at scale poses a difficult challenge. The price of education per student is growing faster than economy-wide costs (Bowen 2012), limiting the resources available to support student learning.
  • When considering the rising need to provide adult retraining, the gap between the demand for education and the ability to provide is especially large.
  • MOOCs largely ignore an important ingredient for learning: high quality feedback.
  • The clear societal need, alongside massive amounts of data has led to a machine learning grand challenge: learn how to provide feedback for education at scale, especially in computer science due to its apparent structure and high demand
  • Methods:

    The authors consider learning tasks given a dataset of n labeled examples, where each example has an input string xi and a target output vector yi = [yi,1, ..., yi,l] composed of l independent binary labels.
  • In this context, the authors assume each string represents a block-based program in Lisplike notation.
  • The authors evaluate on the entire set D
  • Results:

    Recreation of Human Labels

    Figure 4 reports a set of F1 scores, including human-level performance estimated from annotations.
  • The authors find that combining the MVAE with rubric sampling boosts the F1 by an additional 0.2, reaching 94% accuracy in P1 and 95% in P8.
  • With these scores, the authors are reasonably confident that for a new student, despite the likelihood that he/she will submit a never-beforeseen program, the authors will provide good feedback
  • Conclusion:

    The authors get closer to human level performance than previous SOTA. Any of the rubric sampling models beat the SOTA by at least 0.46 in P1 and 0.24 in P8, nearly tripling the F1 score in P1 and doubling in P8.
  • On a widely used platform, the authors show rubric sampling to far surpass the SOTA
  • The authors combine this with a generative model to cluster students, highlight code, and incorporate historical data.
  • This approach can scale feedback for real world use
Tables
  • Table1: Amount of correct feedback over the curriculum. We ignore programs in the head of the Zipf as those can be manually labeled. With the best model, we could have provided 126,000 additional points of feedback
Download tables as Excel
Related work
  • Education Feedback If you were to solve an assignment on Code.org today, the hints you would be given are generated from a unit test system combined with static analysis of the students solution. It has been a widely reported social-good objective to improve upon these hints (Price and Barnes 2017) especially since the state of the art is far from ideal (O’Rourke, Ballweber, and Popoviı 2014). Achieving this goal has proved to be hard. Previous research on a more basic set of Code.org challenges (the “Hour of Code”) have scratched the surface with respect to providing feedback at scale. Original work found that latent patterns in how students solve programming assignments have signal as to how he or she should proceed (Piech et al 2015c). Applying a neural network improved prediction of feedback (Piech et al 2015a) but models were (1) too far from human accuracy, (2) weren’t able to explain its predictions and (3) required massive amounts of data. The current state of the art combines these ideas and provides some improvements (Wang et al 2017a). In this paper we propose a method which uses less data, approaches human accuracy and works on more complex Code.org assignments by diverging from the classic supervised framework. Research on feedback for even more complex assignments such as medical education (Geigle, Zhai, and Ferguson 2016) and natural language questions (Bulgarov and Nielsen 2018) has also relied on datahungry supervised learning and perhaps would benefit from a rubric sampling inspired approach.
Funding
  • MW is supported by NSF GRFP
  • NDG is supported by DARPA PPAML under FA8750-14-2-0006
Reference
  • Bowen, W. G. 2012. The cost disease in higher education: is technology the answer? The Tanner Lectures Stanford University.
    Google ScholarFindings
  • Bowman, S. R.; Vilnis, L.; Vinyals, O.; Dai, A. M.; Jozefowicz, R.; and Bengio, S. 2015. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349.
    Findings
  • Brown, J. S., and VanLehn, K. 1980. Repair theory: A generative theory of bugs in procedural skills. Cognitive science 4(4):379–426.
    Google ScholarLocate open access versionFindings
  • Bulgarov, F. A., and Nielsen, R. 2018. Proposition entailment in educational applications using deep neural networks. In AAAI.
    Google ScholarFindings
  • Feldman, M. Q.; Cho, J. Y.; Ong, M.; Gulwani, S.; Popovic, Z.; and Andersen, E. 2018. Automatic diagnosis of students’ misconceptions in k-8 mathematics. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, 264. ACM.
    Google ScholarLocate open access versionFindings
  • Geigle, C.; Zhai, C.; and Ferguson, D. C. 201An exploration of automated grading of complex assignments. In Proceedings of the Third (2016) ACM Conference on Learning@ Scale, 351–360. ACM.
    Google ScholarLocate open access versionFindings
  • Havlin, S. 1995. The distance between zipf plots. Physica A: Statistical Mechanics and its Applications 216(1-2):148– 150.
    Google ScholarLocate open access versionFindings
  • Klein, D., and Manning, C. D. 2003. A parsing: fast exact viterbi parse selection. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, 40–47. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Koedinger, K. R.; Matsuda, N.; MacLellan, C. J.; and McLaughlin, E. A. 2015. Methods for evaluating simulated learners: Examples from simstudent. In AIED Workshops.
    Google ScholarFindings
  • Lake, B. M.; Salakhutdinov, R.; and Tenenbaum, J. B. 2015. Human-level concept learning through probabilistic program induction. Science 350(6266):1332–1338.
    Google ScholarLocate open access versionFindings
  • Lee, C.-W.; Fang, W.; Yeh, C.-K.; and Wang, Y.-C. F. 2017. Multi-label zero-shot learning with structured knowledge graphs. arXiv preprint arXiv:1711.06526.
    Findings
  • Maaten, L. v. d., and Hinton, G. 2008. Visualizing data using t-sne. Journal of machine learning research 9(Nov):2579– 2605.
    Google ScholarLocate open access versionFindings
  • Nguyen, A.; Piech, C.; Huang, J.; and Guibas, L. 2014. Codewebs: scalable homework search for massive open online programming courses. In Proceedings of the 23rd international conference on World wide web, 491–502. ACM.
    Google ScholarLocate open access versionFindings
  • O’Rourke, E.; Ballweber, C.; and Popoviı, Z. 20Hint systems may negatively impact performance in educational games. In Proceedings of the first ACM conference on Learning@ scale conference, 51–60. ACM.
    Google ScholarLocate open access versionFindings
  • Paaßen, B.; Hammer, B.; Price, T. W.; Barnes, T.; Gross, S.; and Pinkwart, N. 2017. The continuous hint factoryproviding hints in vast and sparsely populated edit distance spaces. arXiv preprint arXiv:1708.06564.
    Findings
  • Piech, C.; Huang, J.; Nguyen, A.; Phulsuksombati, M.; Sahami, M.; and Guibas, L. 2015a. Learning program embeddings to propagate feedback on student code. In Proceedings of the 32nd International Conference on Machine Learning.
    Google ScholarLocate open access versionFindings
  • Piech, C.; Huang, J.; Nguyen, A.; Phulsuksombati, M.; Sahami, M.; and Guibas, L. 2015b. Learning program embeddings to propagate feedback on student code. arXiv preprint arXiv:1505.05969.
    Findings
  • Piech, C.; Sahami, M.; Huang, J.; and Guibas, L. 2015c. Autonomously generating hints by inferring problem solving policies. In Proceedings of the Second (2015) ACM Conference on Learning@ Scale, 195–204. ACM.
    Google ScholarLocate open access versionFindings
  • Price, T. W., and Barnes, T. 2017. Position paper: Block-based programming should offer intelligent support for learners. In Blocks and Beyond Workshop (B&B), 2017 IEEE, 65–68. IEEE.
    Google ScholarLocate open access versionFindings
  • Salimans, T.; Ho, J.; Chen, X.; Sidor, S.; and Sutskever, I. 2017. Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864.
    Findings
  • Selvaraju, R. R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; and Batra, D. 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, 618–626.
    Google ScholarLocate open access versionFindings
  • Vedantam, R.; Fischer, I.; Huang, J.; and Murphy, K. 2017. Generative models of visually grounded imagination. arXiv preprint arXiv:1705.10762.
    Findings
  • Verma, V. K.; Arora, G.; Mishra, A.; and Rai, P. 2018. Generalized zero-shot learning via synthesized examples. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
    Google ScholarLocate open access versionFindings
  • Wang, L.; Sy, A.; Liu, L.; and Piech, C. 2017a. Learning to represent student knowledge on programming exercises using deep learning. In Proceedings of the 10th International Conference on Educational Data Mining; Wuhan, China, 324–329.
    Google ScholarLocate open access versionFindings
  • Wang, W.; Pu, Y.; Verma, V. K.; Fan, K.; Zhang, Y.; Chen, C.; Rai, P.; and Carin, L. 2017b. Zero-shot learning via class-conditioned deep generative models. arXiv preprint arXiv:1711.05820.
    Findings
  • Wu, M., and Goodman, N. 2018. Multimodal generative models for scalable weakly-supervised learning. arXiv preprint arXiv:1802.05335.
    Findings
  • Xian, Y.; Lorenz, T.; Schiele, B.; and Akata, Z. 2018. Feature generating networks for zero-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Best Paper
Best Paper of AAAI, 2019
Tags
Comments