COMET-ATOMIC 2020: On Symbolic and Neural Commonsense Knowledge Graphs

Jena D. Hwang
Jena D. Hwang
Jeff Da
Jeff Da
Antoine Bosselut
Antoine Bosselut
Cited by: 0|Bibtex|Views34
Other Links: arxiv.org
Weibo:
We show that ATOMIC2200 can be effectively used as a training set for adapting language models as knowledge models to generate high quality tuples on-demand

Abstract:

Recent years have brought about a renewed interest in commonsense representation and reasoning in the field of natural language understanding. The development of new commonsense knowledge graphs (CSKG) has been central to these advances as their diverse facts can be used and referenced by machine learning models for tackling new and cha...More

Code:

Data:

0
Introduction
  • Commonsense understanding and reasoning remain longstanding challenges in general artificial intelligence.
  • A new paradigm of language models as knowledge bases has emerged (Petroni et al 2019).
  • The authors evaluate three existing knowledge graphs, CONCEPTNET, ATOMIC, and TRANSOMCS on their coverage and precision relative to the new resource ATOMIC2200.3.
  • Out of 3.4M tuples, 90% of them correspond to taxonomic (e.g., IsA) or lexical (e.g., Synonym, RelatedTo) knowledge, making the commonsense portion of CONCEPTNET (v5.7) relatively small
Highlights
  • Commonsense understanding and reasoning remain longstanding challenges in general artificial intelligence
  • We show that ATOMIC2200 as a transfer resource leads to COMET models that achieve the largest increase over their seed language model for the commonsense knowledge types it covers, validating the importance of constructing knowledge resources with examples of knowledge not readily found in language models
  • Previous work shows knowledge graphs can help language models better transfer as knowledge engines (Bosselut et al 2019) by re-training them on examples of structured knowledge
  • To evaluate whether knowledge graphs can help language models effectively transfer to knowledge models, we train different pretrained language models on the knowledge graphs described in Section 4, which we describe below: GPT2 (Radford et al 2019) is a Transformer (Vaswani et al 2017) based language model
  • We note the large divide between the zero-shot GPT2XL model that produces commonsense knowledge without any fine-tuning and the two COMET models across the ATOMIC2200, ATOMIC, and CONCEPTNET knowledge graphs (Table 6). This large gap indicates that language models can benefit from learning facts from commonsense knowledge graphs
  • We show that ATOMIC2200 can be effectively used as a training set for adapting language models as knowledge models to generate high quality tuples on-demand
Methods
  • To evaluate whether knowledge graphs can help language models effectively transfer to knowledge models, the authors train different pretrained language models on the knowledge graphs described in Section 4, which the authors describe below: GPT2 (Radford et al 2019) is a Transformer (Vaswani et al 2017) based language model.
  • The authors use the largest GPT2 model, GPT2-XL, that has 1.5B parameters.
  • The authors fine-tune GPT2-XL on each of the CSKGs to predict the tail of a tuple given the head and a relation (e.g., MadeUpOf).
  • The authors use GPT2-XL in a zero-shot setting as a baseline to measure the effect of transfer learning on knowledge graphs.
  • The authors convert each relation manually to an English language prompt expecting the tail of each tuple as output generated by the model
Results
  • ATOMIC2200 outperforms other KGs in crowdsourced accuracy as shown in Table 2.7 ATOMIC ties with CONCEPTNET with reasonably high accuracy, while TRAN-

    SOMCS lags behind others with far lower accuracy.
  • The authors note the large divide between the zero-shot GPT2XL model that produces commonsense knowledge without any fine-tuning and the two COMET models across the ATOMIC2200, ATOMIC, and CONCEPTNET knowledge graphs (Table 6)
  • This large gap indicates that language models can benefit from learning facts from commonsense knowledge graphs.
  • They do not have the means to precisely express this knowledge directly from just pretraining on language.
  • The performance gap indicates that high-quality declarative knowledge is valuable even after the advent of extreme scale language models
Conclusion
  • Do pretrained language models already encode commonsense knowledge? The authors' conclusions on this subject are mixed and hinge on the ambiguous meaning of what it means to encode knowledge.
  • The authors look forward to future work in this space that attempts to disentangle these two ideas.In this work, the authors formalize a use for commonsense knowledge graphs as transfer learning tools for pretrained language models.
  • The authors propose ATOMIC2200, a novel commonsense knowledge graph containing tuples whose relations are selected to be challenging for pretrained language models to express.
  • The authors show that ATOMIC2200 can be effectively used as a training set for adapting language models as knowledge models to generate high quality tuples on-demand
Summary
  • Introduction:

    Commonsense understanding and reasoning remain longstanding challenges in general artificial intelligence.
  • A new paradigm of language models as knowledge bases has emerged (Petroni et al 2019).
  • The authors evaluate three existing knowledge graphs, CONCEPTNET, ATOMIC, and TRANSOMCS on their coverage and precision relative to the new resource ATOMIC2200.3.
  • Out of 3.4M tuples, 90% of them correspond to taxonomic (e.g., IsA) or lexical (e.g., Synonym, RelatedTo) knowledge, making the commonsense portion of CONCEPTNET (v5.7) relatively small
  • Methods:

    To evaluate whether knowledge graphs can help language models effectively transfer to knowledge models, the authors train different pretrained language models on the knowledge graphs described in Section 4, which the authors describe below: GPT2 (Radford et al 2019) is a Transformer (Vaswani et al 2017) based language model.
  • The authors use the largest GPT2 model, GPT2-XL, that has 1.5B parameters.
  • The authors fine-tune GPT2-XL on each of the CSKGs to predict the tail of a tuple given the head and a relation (e.g., MadeUpOf).
  • The authors use GPT2-XL in a zero-shot setting as a baseline to measure the effect of transfer learning on knowledge graphs.
  • The authors convert each relation manually to an English language prompt expecting the tail of each tuple as output generated by the model
  • Results:

    ATOMIC2200 outperforms other KGs in crowdsourced accuracy as shown in Table 2.7 ATOMIC ties with CONCEPTNET with reasonably high accuracy, while TRAN-

    SOMCS lags behind others with far lower accuracy.
  • The authors note the large divide between the zero-shot GPT2XL model that produces commonsense knowledge without any fine-tuning and the two COMET models across the ATOMIC2200, ATOMIC, and CONCEPTNET knowledge graphs (Table 6)
  • This large gap indicates that language models can benefit from learning facts from commonsense knowledge graphs.
  • They do not have the means to precisely express this knowledge directly from just pretraining on language.
  • The performance gap indicates that high-quality declarative knowledge is valuable even after the advent of extreme scale language models
  • Conclusion:

    Do pretrained language models already encode commonsense knowledge? The authors' conclusions on this subject are mixed and hinge on the ambiguous meaning of what it means to encode knowledge.
  • The authors look forward to future work in this space that attempts to disentangle these two ideas.In this work, the authors formalize a use for commonsense knowledge graphs as transfer learning tools for pretrained language models.
  • The authors propose ATOMIC2200, a novel commonsense knowledge graph containing tuples whose relations are selected to be challenging for pretrained language models to express.
  • The authors show that ATOMIC2200 can be effectively used as a training set for adapting language models as knowledge models to generate high quality tuples on-demand
Tables
  • Table1: Relations in ATOMIC2200 along with illustrative examples and their respective size. Relations that reflect semantically identical categories to CONCEPTNET is marked with an asterisk (∗)
  • Table2: Accuracy - Percentage (%) of tuples in the knowledge base evaluated by human crowdworkers as either always true or likely (Accept), farfetched/never or invalid (Reject), or unclear (No Judgment)
  • Table3: KG accuracy values broken down by relation
  • Table4: Coverage Precision - Average number of times (in %) a tuple in Source KB is found in Target KB
  • Table5: Coverage Recall - Average number of times (in %) a tuple in Target KB is found in Source KB. †This value is greater than 100 because multiple tuples in ATOMIC2200 can map to the same tuple in ATOMIC
  • Table6: Human evaluation of generation accuracy (%). Each model uses greedy decoding to generate the tail of 5K randomly-sampled test prefixes (head, relation) from each knowledge graph. GPT2-XL, GPT-3 and BART have 1.5B, 175B and 440M parameters, respectively
  • Table7: Automated metrics for the quality of the tail generations of the GPT2-XL language model and the knowledge models COMET(GPT2-XL) and COMET(BART). Each approach uses greedy decoding for sampled 5k test prefixes for each KG. The 5k prefixes correspond to the ones for the human eval. Similar results are obtained on the full test sets (cf. Appendix C)
  • Table8: CONCEPTNET relations mapped to ATOMIC2200 relations. For labels mapping to multiple ATOMIC2200 relations, the one that received the majority mapping is bolded
  • Table9: Human readable templates for each relation used for crowdsourced human evaluations
  • Table10: Number of tuples per KB and per split
  • Table11: Automated metrics for the quality of the tail generations for the knowledge models COMET(GPT2-XL) and COMET(BART). Each approach uses greedy decoding for all test prefixes for each KG. Similar results were obtained on the 5K sampled prefixes that were randomly selected for the human evaluation (see Table 7)
Download tables as Excel
Funding
  • Tuples that were directly incorporated without further edits passed with an acceptance rate of 93% or higher
Study subjects and analysis
workers: 3
The workers were also given a choice to opt out of assessment if the concepts were too unfamiliar for a fair evaluation (No Judgment). Each task (HIT) included 5 tuples of the same relation type, and each tuple was labeled by 3 workers. For the results, we take the majority vote among the 3 workers

workers: 3
Each task (HIT) included 5 tuples of the same relation type, and each tuple was labeled by 3 workers. For the results, we take the majority vote among the 3 workers. Results

Reference
  • Ammanabrolu, P.; Cheung, W.; Broniec, W.; and Riedl, M. 2020. Automated Storytelling via Causal, Commonsense Plot Ordering. ArXiv abs/2009.00829.
    Findings
  • Auer, S.; Bizer, C.; Kobilarov, G.; Lehmann, J.; Cyganiak, R.; and Ives, Z. G. 2007. DBpedia: A Nucleus for a Web of Open Data. In ISWC/ASWC.
    Google ScholarFindings
  • Bisk, Y.; Zellers, R.; Le Bras, R.; Gao, J.; and Choi, Y. 2020. PIQA: Reasoning about Physical Commonsense in Natural Language. In AAAI.
    Google ScholarLocate open access versionFindings
  • Bosselut, A.; Rashkin, H.; Sap, M.; Malaviya, C.; Celikyilmaz, A.; and Choi, Y. 2019. COMET: Commonsense Transformers for Automatic Knowledge Graph Construction. In ACL.
    Google ScholarFindings
  • Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; Agarwal, S.; Herbert-Voss, A.; Kruger, G.; Henighan, T.; Child, R.; Ramesh, A.; Ziegler, D.; Wu, J.; Winter, C.; Hesse, C.; Chen, M.; Sigler, E.; Litwin, M.; Gray, S.; Chess, B.; Clark, J.; Berner, C.; McCandlish, S.; Radford, A.; Sutskever, I.; and Amodei, D. 2020. Language Models are Few-Shot Learners. ArXiv abs/2005.14165.
    Findings
  • Chakrabarty, T.; Ghosh, D.; Muresan, S.; and Peng, N. 2020. R3: Reverse, Retrieve, and Rank for Sarcasm Generation with Commonsense Knowledge. In ACL.
    Google ScholarFindings
  • Davis, E.; and Marcus, G. 2015. Commonsense Reasoning and Commonsense Knowledge in Artificial Intelligence. Commun. ACM 58(9): 92–103.
    Google ScholarLocate open access versionFindings
  • Davison, J.; Feldman, J.; and Rush, A. 2019. Commonsense Knowledge Mining from Pretrained Models. In EMNLPIJCNLP, 1173–117Hong Kong, China.
    Google ScholarFindings
  • Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 201BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT.
    Google ScholarFindings
  • Feng, Y.; Chen, X.; Lin, B. Y.; Wang, P.; Yan, J.; and Ren, X. 2020. Scalable Multi-Hop Relational Reasoning for Knowledge-Aware Question Answering. ArXiv abs/2005.00646.
    Findings
  • Fleiss, J. L. 1971. Measuring nominal scale agreement among many raters. Psychological bulletin 76(5): 378.
    Google ScholarLocate open access versionFindings
  • Gordon, J.; and Van Durme, B. 2013. Reporting bias and knowledge acquisition. In AKBC ’13. ACM.
    Google ScholarLocate open access versionFindings
  • Herskovits, A. 2009. Language and Spatial Cognition: An Interdisciplinary Study of the Prepositions in English.
    Google ScholarFindings
  • Kearns, W. R.; Kaura, N.; Divina, M.; Vo, C. V.; Si, D.; Ward, T. M.; and Yuwen, W. 2020. A Wizard-of-Oz Interface and Persona-based Methodology for Collecting Health Counseling Dialog. Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems.
    Google ScholarFindings
  • Landau, B.; and Jackendoff, R. 1991. Spatial language and spatial cognition. Bridges Between Psychology and Linguistics: A Swarthmore Festschrift for Lila Gleitman 145.
    Google ScholarFindings
  • Lascarides, A.; and Asher, N. 1991. Discourse relations and defeasible knowledge. In ACL.
    Google ScholarFindings
  • Lenat, D. B. 1995. CYC: A large-scale investment in knowledge infrastructure. Communications of the ACM.
    Google ScholarFindings
  • Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; and Zettlemoyer, L. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In ACL.
    Google ScholarFindings
  • Li, X.; Taheri, A.; Tu, L.; and Gimpel, K. 2016. Commonsense Knowledge Base Completion. In ACL.
    Google ScholarLocate open access versionFindings
  • Lin, B. Y.; Chen, X.; Chen, J.; and Ren, X. 2019. KagNet: Knowledge-Aware Graph Networks for Commonsense Reasoning. In EMNLP/IJCNLP.
    Google ScholarFindings
  • Lin, C.-Y. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out.
    Google ScholarFindings
  • Liu, Y.; Yang, T.; You, Z.; Fan, W.; and Yu, P. S. 2020. Commonsense Evidence Generation and Injection in Reading Comprehension. In SIGDIAL.
    Google ScholarLocate open access versionFindings
  • Miller, G. A. 1995. WordNet: A Lexical Database for English. Communications of the ACM.
    Google ScholarFindings
  • Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. BLEU: a method for automatic evaluation of machine translation. In ACL.
    Google ScholarFindings
  • Petroni, F.; Rocktaschel, T.; Lewis, P.; Bakhtin, A.; Wu, Y.; Miller, A. H.; and Riedel, S. 2019. Language Models as Knowledge Bases? In EMNLP.
    Google ScholarLocate open access versionFindings
  • Radford, A.; Narasimhan, K.; Salimans, T.; and Sutskever, I. 2018. Improving language understanding by generative pre-training.
    Google ScholarFindings
  • Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; and Sutskever, I. 2019. Language Models are Unsupervised Multitask Learners.
    Google ScholarFindings
  • Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; and Liu, P. J. 2019. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. ArXiv abs/1910.10683.
    Findings
  • Roberts, A.; Raffel, C.; and Shazeer, N. 2020. How Much Knowledge Can You Pack Into the Parameters of a Language Model?
    Google ScholarFindings
  • Sap, M.; Le Bras, R.; Allaway, E.; Bhagavatula, C.; Lourie, N.; Rashkin, H.; Roof, B.; Smith, N. A.; and Choi, Y. 2019. ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning. In AAAI.
    Google ScholarFindings
  • Shwartz, V.; West, P.; Bras, R. L.; Bhagavatula, C.; and Choi, Y. 2020. Unsupervised Commonsense Question Answering with Self-Talk. ArXiv abs/2004.05483.
    Findings
  • Speer, R.; Chin, J.; and Havasi, C. 2017. Conceptnet 5.5: An open multilingual graph of general knowledge. In AAAI.
    Google ScholarFindings
  • Talmy, L. 1988. Force Dynamics in Language and Cognition. Cognitive Science 12: 49–100.
    Google ScholarLocate open access versionFindings
  • Tamborrino, A.; Pellicano, N.; Pannier, B.; Voitot, P.; and Naudin, L. 2020. Pre-training Is (Almost) All You Need: An Application to Commonsense Reasoning. In ACL.
    Google ScholarFindings
  • Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. In Advances in Neural Information Processing Systems.
    Google ScholarLocate open access versionFindings
  • Vedantam, R.; Lawrence Zitnick, C.; and Parikh, D. 2015. Cider: Consensus-based image description evaluation. In CVPR, 4566–4575.
    Google ScholarFindings
  • Wang, A.; Pruksachatkun, Y.; Nangia, N.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; and Bowman, S. R. 2019. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. ArXiv abs/1905.00537.
    Findings
  • Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; Davison, J.; Shleifer, S.; von Platen, P.; Ma, C.; Jernite, Y.; Plu, J.; Xu, C.; Scao, T. L.; Gugger, S.; Drame, M.; Lhoest, Q.; and Rush, A. M. 2019. HuggingFace’s Transformers: State-of-the-art Natural Language Processing. ArXiv abs/1910.03771.
    Findings
  • Zellers, R.; Holtzman, A.; Bisk, Y.; Farhadi, A.; and Choi, Y. 2019. HellaSwag: Can a Machine Really Finish Your Sentence? In ACL.
    Google ScholarLocate open access versionFindings
  • Zhang, H.; Khashabi, D.; Song, Y.; and Roth, D. 2020a. TransOMCS: From Linguistic Graphs to Commonsense Knowledge. In IJCAI.
    Google ScholarFindings
  • Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.; and Artzi, Y. 2020b. BERTScore: Evaluating Text Generation with BERT. ArXiv abs/1904.09675.
    Findings
  • A ATOMIC 2020 Details
    Google ScholarFindings
  • Social-Interaction Commonsense. Social-interaction relations comment on socially-triggered states and behaviors. Social commonsense is useful for gauging people’s intentions and purpose, and predicting situationally-relevant human reactions and behaviors. Following the definitions for ATOMIC relations (Sap et al. 2019), we identify a total of nine relations within this category.
    Google ScholarLocate open access versionFindings
  • For a discussion of force dynamics in cognitive linguistic and lexical semantic literature cf. Herskovits (2009); Landau and Jackendoff (1991); Talmy (1988).
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments