Bongard-LOGO: A New Benchmark for Human-Level Concept Learning and Reasoning

NIPS 2020, 2020.

Cited by: 0|Bibtex|Views20
EI
Other Links: arxiv.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
Our benchmark, named BONGARD-LOGO, is inspired by the original Bongard Problems that were carefully designed in 1960s for demonstrating the chasms between human visual cognition and computerized pattern recognition

Abstract:

Humans have an inherent ability to learn novel concepts from only a few samples and generalize these concepts to different situations. Even though today's machine learning models excel with a plethora of training data on standard recognition tasks, a considerable gap exists between machine-level pattern recognition and human-level conce...More
0
Introduction
  • Remarkable human visual cognition is exemplified by their ability to learn new concepts from a few examples and use the acquired concepts in diverse ways.
  • In contrast to human concept learning, data-driven approaches to machine perception have to be trained on massive datasets and their abilities of re-using acquired concepts in new situations are bounded by the training data [3]
  • For this reason, researchers in cognitive science and artificial intelligence (AI) have attempted to explain the chasm between human-level visual cognition and machine-based pattern recognition.
Highlights
  • Remarkable human visual cognition is exemplified by their ability to learn new concepts from a few examples and use the acquired concepts in diverse ways
  • In contrast to human concept learning, data-driven approaches to machine perception have to be trained on massive datasets and their abilities of re-using acquired concepts in new situations are bounded by the training data [3]
  • We introduced a new visual cognition benchmark that emphasizes concept learning and reasoning
  • Our benchmark, named BONGARD-LOGO, is inspired by the original Bongard Problems (BPs) [11] that were carefully designed in 1960s for demonstrating the chasms between human visual cognition and computerized pattern recognition
  • In a similar vein as the original one hundred BPs, our benchmark aims for a new form of human-like perception that is context-dependent, analogical, and few-shot of infinite vocabulary
  • It opens the door to many research questions: What types of inductive biases would benefit concept learning models? Are deep neural networks the ultimate key to human-like visual cognition? How is the Bongard-style visual reasoning connected to semantics and pragmatics in natural language?
Methods
  • Novel abstract shape test set This test set (NV) includes 320 abstract shape problems.
  • Different from the construction of combinatorial abstract shape test set (CM), the authors hold out one attribute and all its combinations with other attributes from the training set.
  • All problems related to the held-out attribute are exclusive in this test set.
  • The authors choose "have_eight_straight_lines" as the held-out attribute, as presumably it requires minimal effort for the model to extrapolate given that other similar "have_[xxx]_straight_lines" attributes already exist in the training set
Results
  • The authors report the test accuracy (Acc) of different methods on each of the four test sets respectively, and compare the results to the human performance in Table 1.
  • To familiarize the human subjects with BONGARD-LOGO problems, the authors describe each problem type and provide detailed instructions on how to solve these problems.
  • It normally takes 30-60 minutes for human subjects to fully digest the instructions.
  • Depending on the total time that a human subject spends on digesting instructions and completing all tasks, the authors have split 12 human subjects into two evenly distributed groups: Human (Expert) who well understand and carefully follow the instructions, and Human (Amateur) who quickly skim the instructions or do not follow them at all
Conclusion
  • The authors introduced a new visual cognition benchmark that emphasizes concept learning and reasoning.
  • The authors' benchmark, named BONGARD-LOGO, is inspired by the original BPs [11] that were carefully designed in 1960s for demonstrating the chasms between human visual cognition and computerized pattern recognition.
  • To fuel research towards new computational architectures that give rise to such human-like perception, the authors develop a program-guided problem generation technique that enables them to produce a large-scale dataset of 12K human-interpretable problems, making it digestible by data-driven learning methods to date.
  • It opens the door to many research questions: What types of inductive biases would benefit concept learning models? Are deep neural networks the ultimate key to human-like visual cognition? How is the Bongard-style visual reasoning connected to semantics and pragmatics in natural language?
Summary
  • Introduction:

    Remarkable human visual cognition is exemplified by their ability to learn new concepts from a few examples and use the acquired concepts in diverse ways.
  • In contrast to human concept learning, data-driven approaches to machine perception have to be trained on massive datasets and their abilities of re-using acquired concepts in new situations are bounded by the training data [3]
  • For this reason, researchers in cognitive science and artificial intelligence (AI) have attempted to explain the chasm between human-level visual cognition and machine-based pattern recognition.
  • Methods:

    Novel abstract shape test set This test set (NV) includes 320 abstract shape problems.
  • Different from the construction of combinatorial abstract shape test set (CM), the authors hold out one attribute and all its combinations with other attributes from the training set.
  • All problems related to the held-out attribute are exclusive in this test set.
  • The authors choose "have_eight_straight_lines" as the held-out attribute, as presumably it requires minimal effort for the model to extrapolate given that other similar "have_[xxx]_straight_lines" attributes already exist in the training set
  • Results:

    The authors report the test accuracy (Acc) of different methods on each of the four test sets respectively, and compare the results to the human performance in Table 1.
  • To familiarize the human subjects with BONGARD-LOGO problems, the authors describe each problem type and provide detailed instructions on how to solve these problems.
  • It normally takes 30-60 minutes for human subjects to fully digest the instructions.
  • Depending on the total time that a human subject spends on digesting instructions and completing all tasks, the authors have split 12 human subjects into two evenly distributed groups: Human (Expert) who well understand and carefully follow the instructions, and Human (Amateur) who quickly skim the instructions or do not follow them at all
  • Conclusion:

    The authors introduced a new visual cognition benchmark that emphasizes concept learning and reasoning.
  • The authors' benchmark, named BONGARD-LOGO, is inspired by the original BPs [11] that were carefully designed in 1960s for demonstrating the chasms between human visual cognition and computerized pattern recognition.
  • To fuel research towards new computational architectures that give rise to such human-like perception, the authors develop a program-guided problem generation technique that enables them to produce a large-scale dataset of 12K human-interpretable problems, making it digestible by data-driven learning methods to date.
  • It opens the door to many research questions: What types of inductive biases would benefit concept learning models? Are deep neural networks the ultimate key to human-like visual cognition? How is the Bongard-style visual reasoning connected to semantics and pragmatics in natural language?
Tables
  • Table1: Model performance versus human performance in BONGARD-LOGO. We report the test accuracy (%) on different dataset splits, including free-form shape test set (FF), basic shape test set (BA), combinatorial abstract shape test set (CM), and novel abstract shape test set (NV). Note that for human evaluation, we report the separate results across two groups of human subjects: Human (Expert) who well understand and carefully follow the instructions, and Human (Amateur) who quickly skim the instructions or do not follow them at all. The chance performance is 50%
  • Table2: Model performance versus human performance in a variant of BONGARD-LOGO which only includes 12,000 free-form shape problems. We report the training and test accuracy (%) on the free-form shape test set (FF). Note that for human evaluation, we report the separate results across two groups of human subjects: Human (Expert) who well understand and carefully follow the instructions, and Human (Amateur) who quickly skim the instructions or do not follow them at all. The chance performance is 50%
  • Table3: Model performance versus human performance in BONGARD-LOGO. We report the test accuracy (%) on different dataset splits, including free-form shape test set (FF), basic shape test set (BA), combinatorial abstract shape test set (CM), and novel abstract shape test set (NV). Note that for human evaluation, we report the separate results across two groups of human subjects: Human (Expert) who well understand and carefully follow the instructions, and Human (Amateur) who quickly skim the instructions or do not follow them at all. Note that Meta-Baseline-PS means the Meta-Baseline based on program synthesis, which incorporates symbolic information to solve the task. The chance performance is 50%
Download tables as Excel
Related work
  • Few-shot learning and meta-learning The goal of few-shot learning is to learn a new task (e.g., recognizing new object categories) from a small amount of training data. Pioneer works have approached it with Bayesian inference [32] and metric learning [33]. A rising trend is to formulate few-shot learning as meta-learning [4, 5]. These methods can be categorized into three families: 1) memory-based methods, e.g., a variant of MANN [29] and SNAIL [19], 2) metric-based methods, e.g., Matching Networks [34] and ProtoNet [20], and 3) optimization-based methods, e.g., MAML [5], MetaOptNet [21] and ANIL [22]. Recent work [23, 27] has achieved competitive or even better performances on few-shot image recognition benchmarks [3, 35] with a simple pre-training baseline than advanced meta-learning algorithms. It also sheds light on a rethinking of few-shot image classification benchmarks and the associated role of meta-learning algorithms.
Funding
  • Human (Expert) can easily achieve nearly perfect performances (>99% test accuracy) on the basic shape test set, while the best performing models only achieve around 70% accuracy
  • On free-form shape, combinatorial abstract shape and novel abstract shape test sets where the existence of infinite vocabulary or abstract attributes makes these problems more challenging, Human (Expert) can still get high performances (>90% test accuracy) while all the models only have around or less than 65% accuracy
  • WReN, the model for solving RPMs [17], suffers from the severest overfitting issues, where its training accuracy is around 78% but its test accuracies are all marginally better than random guess (50%)
  • There still exists a large gap between the model and human performance on free-form shape problems alone (74.5% vs. 92.1%)
Study subjects and analysis
human subjects: 12
We put the experiment setup for training these methods to Appendix C, and the results are averaged across three different runs. To show the human performance on our benchmark, we choose 12 human subjects to test on randomly sampled 20 problems from each test set. Note that for human evaluation, we do not differentiate test set (CM) and test set (NV) and thus report only one score on them, as humans essentially perform the same kind of new abstract discovery on both two test sets

human subjects: 12
It normally takes 30-60 minutes for human subjects to fully digest the instructions. Depending on the total time that a human subject spends on digesting instructions and completing all tasks, we have split 12 human subjects into two evenly distributed groups: Human (Expert) who well understand and carefully follow the instructions, and Human (Amateur) who quickly skim the instructions or do not follow them at all. Performance analysis In Table 1, we can see that there exists a significant gap between the Human (Expert) performance and the best model performance across all different test sets

Reference
  • J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255, Ieee, 2009.
    Google ScholarFindings
  • T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision, pp. 740–755, Springer, 2014.
    Google ScholarLocate open access versionFindings
  • B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum, “Human-level concept learning through probabilistic program induction,” Science, vol. 350, no. 6266, pp. 1332–1338, 2015.
    Google ScholarLocate open access versionFindings
  • S. Ravi and H. Larochelle, “Optimization as a model for few-shot learning,” in ICLR, 2016.
    Google ScholarFindings
  • C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135, JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, et al., “Relational inductive biases, deep learning, and graph networks,” arXiv preprint arXiv:1806.01261, 2018.
    Findings
  • J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. Lawrence Zitnick, and R. Girshick, “Clevr: A diagnostic dataset for compositional language and elementary visual reasoning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2901– 2910, 2017.
    Google ScholarLocate open access versionFindings
  • T. R. Besold, A. d. Garcez, S. Bader, H. Bowman, P. Domingos, P. Hitzler, K.-U. Kühnberger, L. C. Lamb, D. Lowd, P. M. V. Lima, et al., “Neural-symbolic learning and reasoning: A survey and interpretation,” arXiv preprint arXiv:1711.03902, 2017.
    Findings
  • A. d. Garcez, T. R. Besold, L. De Raedt, P. Földiak, P. Hitzler, T. Icard, K.-U. Kühnberger, L. C. Lamb, R. Miikkulainen, and D. L. Silver, “Neural-symbolic learning and reasoning: contributions and challenges,” in 2015 AAAI Spring Symposium Series, 2015.
    Google ScholarLocate open access versionFindings
  • D. R. Hofstadter, Fluid concepts and creative analogies: Computer models of the fundamental mechanisms of thought. Basic books, 1995.
    Google ScholarFindings
  • M. M. Bongard, “The recognition problem,” tech. rep., Foreign Technology Div WrightPatterson AFB Ohio, 1968.
    Google ScholarFindings
  • K. Saito and R. Nakano, “A concept learning algorithm with adaptive search,” in Machine intelligence 14: applied machine intelligence, pp. 347–363, 1996.
    Google ScholarLocate open access versionFindings
  • S. Depeweg, C. A. Rothkopf, and F. Jäkel, “Solving bongard problems with a visual language and pragmatic reasoning,” arXiv preprint arXiv:1804.04452, 2018.
    Findings
  • D. J. Chalmers, R. M. French, and D. R. Hofstadter, “High-level perception, representation, and analogy: A critique of artificial intelligence methodology,” Journal of Experimental & Theoretical Artificial Intelligence, vol. 4, no. 3, pp. 185–211, 1992.
    Google ScholarLocate open access versionFindings
  • A. Linhares, “A glimpse at the metaphysics of bongard problems,” Artificial Intelligence, vol. 121, no. 1-2, pp. 251–270, 2000.
    Google ScholarLocate open access versionFindings
  • H. Abelson, N. Goodman, and L. Rudolph, “Logo manual,” 1974.
    Google ScholarFindings
  • D. G. Barrett, F. Hill, A. Santoro, A. S. Morcos, and T. Lillicrap, “Measuring abstract reasoning in neural networks,” arXiv preprint arXiv:1807.04225, 2018.
    Findings
  • E. Triantafillou, T. Zhu, V. Dumoulin, P. Lamblin, U. Evci, K. Xu, R. Goroshin, C. Gelada, K. Swersky, P.-A. Manzagol, et al., “Meta-dataset: A dataset of datasets for learning to learn from few examples,” arXiv preprint arXiv:1903.03096, 2019.
    Findings
  • N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel, “A simple neural attentive meta-learner,” ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • J. Snell, K. Swersky, and R. S. Zemel, “Prototypical Networks for Few-shot Learning,” Advances in Neural Information Processing Systems, 2017.
    Google ScholarLocate open access versionFindings
  • K. Lee, S. Maji, A. Ravichandran, and S. Soatto, “Meta-learning with differentiable convex optimization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10657–10665, 2019.
    Google ScholarLocate open access versionFindings
  • A. Raghu, M. Raghu, S. Bengio, and O. Vinyals, “Rapid learning or feature reuse? towards understanding the effectiveness of maml,” ICLR, 2020.
    Google ScholarLocate open access versionFindings
  • Y. Chen, X. Wang, Z. Liu, H. Xu, and T. Darrell, “A new meta-baseline for few-shot learning,” arXiv preprint arXiv:2003.04390, 2020.
    Findings
  • M. Lázaro-Gredilla, D. Lin, J. S. Guntupalli, and D. George, “Beyond imitation: Zero-shot task transfer on robots by learning concepts as cognitive programs,” Science Robotics, vol. 4, no. 26, 2019.
    Google ScholarLocate open access versionFindings
  • B. Indurkhya, Metaphor and cognition: An interactionist approach, vol.
    Google ScholarLocate open access versionFindings
  • 13. Springer Science & Business Media, 2013.
    Google ScholarFindings
  • [26] K. J. Holyoak, D. Gentner, and B. N. Kokinov, “Introduction: The place of analogy in cognition,” The analogical mind: Perspectives from cognitive science, pp. 1–19, 2001.
    Google ScholarFindings
  • [27] Y. Tian, Y. Wang, D. Krishnan, J. B. Tenenbaum, and P. Isola, “Rethinking few-shot image classification: a good embedding is all you need?,” arXiv preprint arXiv:2003.11539, 2020.
    Findings
  • [28] J. B. Tenenbaum and F. Xu, “Word learning as bayesian inference,” in Proceedings of the Annual Meeting of the Cognitive Science Society, vol. 22, 2000.
    Google ScholarLocate open access versionFindings
  • [29] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap, “Meta-learning with memory-augmented neural networks,” in International conference on machine learning, pp. 1842–1850, 2016.
    Google ScholarLocate open access versionFindings
  • [30] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” arXiv preprint arXiv:1911.05722, 2019.
    Findings
  • [31] J. C. Raven, “Standardization of progressive matrices, 1938,” British Journal of Medical Psychology, vol. 19, no. 1, pp. 137–150, 1941.
    Google ScholarLocate open access versionFindings
  • [32] L. Fe-Fei et al., “A bayesian approach to unsupervised one-shot learning of object categories,” in Proceedings Ninth IEEE International Conference on Computer Vision, pp. 1134–1141, IEEE, 2003.
    Google ScholarLocate open access versionFindings
  • [33] G. Koch, R. Zemel, and R. Salakhutdinov, “Siamese neural networks for one-shot image recognition,” in ICML deep learning workshop, vol. 2, Lille, 2015.
    Google ScholarFindings
  • [34] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al., “Matching networks for one shot learning,” in Advances in neural information processing systems, pp. 3630–3638, 2016.
    Google ScholarLocate open access versionFindings
  • [35] M. Ren, E. Triantafillou, S. Ravi, J. Snell, K. Swersky, J. B. Tenenbaum, H. Larochelle, and R. S. Zemel, “Meta-learning for semi-supervised few-shot classification,” ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • [36] S. K. Divvala, A. Farhadi, and C. Guestrin, “Learning everything about anything: Weblysupervised visual concept learning,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014.
    Google ScholarLocate open access versionFindings
  • [37] N. Hay, M. Stark, A. Schlegel, C. Wendelken, D. Park, E. Purdy, T. Silver, D. S. Phoenix, and D. George, “Behavior is everything: Towards representing concepts with sensorimotor contingencies,” in Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
    Google ScholarLocate open access versionFindings
  • [38] J. Mao, C. Gan, P. Kohli, J. B. Tenenbaum, and J. Wu, “The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision,” in ICLR, 2019.
    Google ScholarFindings
  • [39] C. Han, J. Mao, C. Gan, J. Tenenbaum, and J. Wu, “Visual concept-metaconcept learning,” in Advances in Neural Information Processing Systems, pp. 5002–5013, 2019.
    Google ScholarLocate open access versionFindings
  • [40] A. Bakhtin, L. van der Maaten, J. Johnson, L. Gustafson, and R. Girshick, “Phyre: A new benchmark for physical reasoning,” in Advances in Neural Information Processing Systems, pp. 5083–5094, 2019.
    Google ScholarLocate open access versionFindings
  • [41] K. R. Allen, K. A. Smith, and J. B. Tenenbaum, “The tools challenge: Rapid trial-and-error learning in physical problem solving,” arXiv preprint arXiv:1907.09620, 2019.
    Findings
  • [42] E. Weitnauer and H. Ritter, “Physical bongard problems,” in Ifip international conference on artificial intelligence applications and innovations, pp. 157–163, Springer, 2012.
    Google ScholarLocate open access versionFindings
  • [43] D. Saxton, E. Grefenstette, F. Hill, and P. Kohli, “Analysing mathematical reasoning abilities of neural models,” arXiv preprint arXiv:1904.01557, 2019.
    Findings
  • [44] P. A. Carpenter, M. A. Just, and P. Shell, “What one intelligence test measures: a theoretical account of the processing in the raven progressive matrices test.,” Psychological review, vol. 97, no. 3, p. 404, 1990.
    Google ScholarLocate open access versionFindings
  • [45] C. Zhang, F. Gao, B. Jia, Y. Zhu, and S.-C. Zhu, “Raven: A dataset for relational and analogical visual reasoning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5317–5327, 2019.
    Google ScholarLocate open access versionFindings
  • [46] D. Teney, P. Wang, J. Cao, L. Liu, C. Shen, and A. v. d. Hengel, “V-prom: A benchmark for visual reasoning using visual progressive matrices,” in AAAI, 2020.
    Google ScholarFindings
  • [47] M. Kunda, K. McGreggor, and A. K. Goel, “A computational model for solving problems from the raven’s progressive matrices intelligence test using iconic visual representations,” Cognitive Systems Research, vol. 22, pp. 47–66, 2013.
    Google ScholarLocate open access versionFindings
  • [48] T. Mitchell, W. Cohen, E. Hruschka, P. Talukdar, B. Yang, J. Betteridge, A. Carlson, B. Dalvi, M. Gardner, B. Kisiel, et al., “Never-ending learning,” Communications of the ACM, vol. 61, no. 5, pp. 103–115, 2018.
    Google ScholarLocate open access versionFindings
  • [49] D. Ha and J. Schmidhuber, “World models,” arXiv preprint arXiv:1803.10122, 2018.
    Findings
  • [50] C. M. Bishop, “Mixture density networks,” 1994.
    Google ScholarFindings
  • [20] ProtoNet is a metric-based meta-learning method. It proposed a simple method called prototypical networks based on the idea that we can represent each class by the mean of its examples in a representation space learned by a neural network. In the learned metric space, classification can be performed by computing distances to prototype representations of each class.
    Google ScholarFindings
  • [21] MetaOptNet is an optimization-based meta-learning method. It proposed to learn the feature representation that can generalize well for a linear support vector machine (SVM) classifier.
    Google ScholarFindings
  • [22] Meta-Baseline-SC [23] Meta-Baseline-MoCo
    Google ScholarFindings
  • [22] Meta-Baseline-SC [23] Meta-Baseline-MoCo
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments