COMET-ATOMIC 2020: On Symbolic and Neural Commonsense Knowledge Graphs
Weibo:
Abstract:
Recent years have brought about a renewed interest in commonsense representation and reasoning in the field of natural language understanding. The development of new commonsense knowledge graphs (CSKG) has been central to these advances as their diverse facts can be used and referenced by machine learning models for tackling new and cha...More
Code:
Data:
Introduction
- Commonsense understanding and reasoning remain longstanding challenges in general artificial intelligence.
- A new paradigm of language models as knowledge bases has emerged (Petroni et al 2019).
- The authors evaluate three existing knowledge graphs, CONCEPTNET, ATOMIC, and TRANSOMCS on their coverage and precision relative to the new resource ATOMIC2200.3.
- Out of 3.4M tuples, 90% of them correspond to taxonomic (e.g., IsA) or lexical (e.g., Synonym, RelatedTo) knowledge, making the commonsense portion of CONCEPTNET (v5.7) relatively small
Highlights
- Commonsense understanding and reasoning remain longstanding challenges in general artificial intelligence
- We show that ATOMIC2200 as a transfer resource leads to COMET models that achieve the largest increase over their seed language model for the commonsense knowledge types it covers, validating the importance of constructing knowledge resources with examples of knowledge not readily found in language models
- Previous work shows knowledge graphs can help language models better transfer as knowledge engines (Bosselut et al 2019) by re-training them on examples of structured knowledge
- To evaluate whether knowledge graphs can help language models effectively transfer to knowledge models, we train different pretrained language models on the knowledge graphs described in Section 4, which we describe below: GPT2 (Radford et al 2019) is a Transformer (Vaswani et al 2017) based language model
- We note the large divide between the zero-shot GPT2XL model that produces commonsense knowledge without any fine-tuning and the two COMET models across the ATOMIC2200, ATOMIC, and CONCEPTNET knowledge graphs (Table 6). This large gap indicates that language models can benefit from learning facts from commonsense knowledge graphs
- We show that ATOMIC2200 can be effectively used as a training set for adapting language models as knowledge models to generate high quality tuples on-demand
Methods
- To evaluate whether knowledge graphs can help language models effectively transfer to knowledge models, the authors train different pretrained language models on the knowledge graphs described in Section 4, which the authors describe below: GPT2 (Radford et al 2019) is a Transformer (Vaswani et al 2017) based language model.
- The authors use the largest GPT2 model, GPT2-XL, that has 1.5B parameters.
- The authors fine-tune GPT2-XL on each of the CSKGs to predict the tail of a tuple given the head and a relation (e.g., MadeUpOf).
- The authors use GPT2-XL in a zero-shot setting as a baseline to measure the effect of transfer learning on knowledge graphs.
- The authors convert each relation manually to an English language prompt expecting the tail of each tuple as output generated by the model
Results
- ATOMIC2200 outperforms other KGs in crowdsourced accuracy as shown in Table 2.7 ATOMIC ties with CONCEPTNET with reasonably high accuracy, while TRAN-
SOMCS lags behind others with far lower accuracy. - The authors note the large divide between the zero-shot GPT2XL model that produces commonsense knowledge without any fine-tuning and the two COMET models across the ATOMIC2200, ATOMIC, and CONCEPTNET knowledge graphs (Table 6)
- This large gap indicates that language models can benefit from learning facts from commonsense knowledge graphs.
- They do not have the means to precisely express this knowledge directly from just pretraining on language.
- The performance gap indicates that high-quality declarative knowledge is valuable even after the advent of extreme scale language models
Conclusion
- Do pretrained language models already encode commonsense knowledge? The authors' conclusions on this subject are mixed and hinge on the ambiguous meaning of what it means to encode knowledge.
- The authors look forward to future work in this space that attempts to disentangle these two ideas.In this work, the authors formalize a use for commonsense knowledge graphs as transfer learning tools for pretrained language models.
- The authors propose ATOMIC2200, a novel commonsense knowledge graph containing tuples whose relations are selected to be challenging for pretrained language models to express.
- The authors show that ATOMIC2200 can be effectively used as a training set for adapting language models as knowledge models to generate high quality tuples on-demand
Summary
Introduction:
Commonsense understanding and reasoning remain longstanding challenges in general artificial intelligence.- A new paradigm of language models as knowledge bases has emerged (Petroni et al 2019).
- The authors evaluate three existing knowledge graphs, CONCEPTNET, ATOMIC, and TRANSOMCS on their coverage and precision relative to the new resource ATOMIC2200.3.
- Out of 3.4M tuples, 90% of them correspond to taxonomic (e.g., IsA) or lexical (e.g., Synonym, RelatedTo) knowledge, making the commonsense portion of CONCEPTNET (v5.7) relatively small
Methods:
To evaluate whether knowledge graphs can help language models effectively transfer to knowledge models, the authors train different pretrained language models on the knowledge graphs described in Section 4, which the authors describe below: GPT2 (Radford et al 2019) is a Transformer (Vaswani et al 2017) based language model.- The authors use the largest GPT2 model, GPT2-XL, that has 1.5B parameters.
- The authors fine-tune GPT2-XL on each of the CSKGs to predict the tail of a tuple given the head and a relation (e.g., MadeUpOf).
- The authors use GPT2-XL in a zero-shot setting as a baseline to measure the effect of transfer learning on knowledge graphs.
- The authors convert each relation manually to an English language prompt expecting the tail of each tuple as output generated by the model
Results:
ATOMIC2200 outperforms other KGs in crowdsourced accuracy as shown in Table 2.7 ATOMIC ties with CONCEPTNET with reasonably high accuracy, while TRAN-
SOMCS lags behind others with far lower accuracy.- The authors note the large divide between the zero-shot GPT2XL model that produces commonsense knowledge without any fine-tuning and the two COMET models across the ATOMIC2200, ATOMIC, and CONCEPTNET knowledge graphs (Table 6)
- This large gap indicates that language models can benefit from learning facts from commonsense knowledge graphs.
- They do not have the means to precisely express this knowledge directly from just pretraining on language.
- The performance gap indicates that high-quality declarative knowledge is valuable even after the advent of extreme scale language models
Conclusion:
Do pretrained language models already encode commonsense knowledge? The authors' conclusions on this subject are mixed and hinge on the ambiguous meaning of what it means to encode knowledge.- The authors look forward to future work in this space that attempts to disentangle these two ideas.In this work, the authors formalize a use for commonsense knowledge graphs as transfer learning tools for pretrained language models.
- The authors propose ATOMIC2200, a novel commonsense knowledge graph containing tuples whose relations are selected to be challenging for pretrained language models to express.
- The authors show that ATOMIC2200 can be effectively used as a training set for adapting language models as knowledge models to generate high quality tuples on-demand
Tables
- Table1: Relations in ATOMIC2200 along with illustrative examples and their respective size. Relations that reflect semantically identical categories to CONCEPTNET is marked with an asterisk (∗)
- Table2: Accuracy - Percentage (%) of tuples in the knowledge base evaluated by human crowdworkers as either always true or likely (Accept), farfetched/never or invalid (Reject), or unclear (No Judgment)
- Table3: KG accuracy values broken down by relation
- Table4: Coverage Precision - Average number of times (in %) a tuple in Source KB is found in Target KB
- Table5: Coverage Recall - Average number of times (in %) a tuple in Target KB is found in Source KB. †This value is greater than 100 because multiple tuples in ATOMIC2200 can map to the same tuple in ATOMIC
- Table6: Human evaluation of generation accuracy (%). Each model uses greedy decoding to generate the tail of 5K randomly-sampled test prefixes (head, relation) from each knowledge graph. GPT2-XL, GPT-3 and BART have 1.5B, 175B and 440M parameters, respectively
- Table7: Automated metrics for the quality of the tail generations of the GPT2-XL language model and the knowledge models COMET(GPT2-XL) and COMET(BART). Each approach uses greedy decoding for sampled 5k test prefixes for each KG. The 5k prefixes correspond to the ones for the human eval. Similar results are obtained on the full test sets (cf. Appendix C)
- Table8: CONCEPTNET relations mapped to ATOMIC2200 relations. For labels mapping to multiple ATOMIC2200 relations, the one that received the majority mapping is bolded
- Table9: Human readable templates for each relation used for crowdsourced human evaluations
- Table10: Number of tuples per KB and per split
- Table11: Automated metrics for the quality of the tail generations for the knowledge models COMET(GPT2-XL) and COMET(BART). Each approach uses greedy decoding for all test prefixes for each KG. Similar results were obtained on the 5K sampled prefixes that were randomly selected for the human evaluation (see Table 7)
Funding
- Tuples that were directly incorporated without further edits passed with an acceptance rate of 93% or higher
Study subjects and analysis
workers: 3
The workers were also given a choice to opt out of assessment if the concepts were too unfamiliar for a fair evaluation (No Judgment). Each task (HIT) included 5 tuples of the same relation type, and each tuple was labeled by 3 workers. For the results, we take the majority vote among the 3 workers
workers: 3
Each task (HIT) included 5 tuples of the same relation type, and each tuple was labeled by 3 workers. For the results, we take the majority vote among the 3 workers. Results
Reference
- Ammanabrolu, P.; Cheung, W.; Broniec, W.; and Riedl, M. 2020. Automated Storytelling via Causal, Commonsense Plot Ordering. ArXiv abs/2009.00829.
- Auer, S.; Bizer, C.; Kobilarov, G.; Lehmann, J.; Cyganiak, R.; and Ives, Z. G. 2007. DBpedia: A Nucleus for a Web of Open Data. In ISWC/ASWC.
- Bisk, Y.; Zellers, R.; Le Bras, R.; Gao, J.; and Choi, Y. 2020. PIQA: Reasoning about Physical Commonsense in Natural Language. In AAAI.
- Bosselut, A.; Rashkin, H.; Sap, M.; Malaviya, C.; Celikyilmaz, A.; and Choi, Y. 2019. COMET: Commonsense Transformers for Automatic Knowledge Graph Construction. In ACL.
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; Agarwal, S.; Herbert-Voss, A.; Kruger, G.; Henighan, T.; Child, R.; Ramesh, A.; Ziegler, D.; Wu, J.; Winter, C.; Hesse, C.; Chen, M.; Sigler, E.; Litwin, M.; Gray, S.; Chess, B.; Clark, J.; Berner, C.; McCandlish, S.; Radford, A.; Sutskever, I.; and Amodei, D. 2020. Language Models are Few-Shot Learners. ArXiv abs/2005.14165.
- Chakrabarty, T.; Ghosh, D.; Muresan, S.; and Peng, N. 2020. R3: Reverse, Retrieve, and Rank for Sarcasm Generation with Commonsense Knowledge. In ACL.
- Davis, E.; and Marcus, G. 2015. Commonsense Reasoning and Commonsense Knowledge in Artificial Intelligence. Commun. ACM 58(9): 92–103.
- Davison, J.; Feldman, J.; and Rush, A. 2019. Commonsense Knowledge Mining from Pretrained Models. In EMNLPIJCNLP, 1173–117Hong Kong, China.
- Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 201BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT.
- Feng, Y.; Chen, X.; Lin, B. Y.; Wang, P.; Yan, J.; and Ren, X. 2020. Scalable Multi-Hop Relational Reasoning for Knowledge-Aware Question Answering. ArXiv abs/2005.00646.
- Fleiss, J. L. 1971. Measuring nominal scale agreement among many raters. Psychological bulletin 76(5): 378.
- Gordon, J.; and Van Durme, B. 2013. Reporting bias and knowledge acquisition. In AKBC ’13. ACM.
- Herskovits, A. 2009. Language and Spatial Cognition: An Interdisciplinary Study of the Prepositions in English.
- Kearns, W. R.; Kaura, N.; Divina, M.; Vo, C. V.; Si, D.; Ward, T. M.; and Yuwen, W. 2020. A Wizard-of-Oz Interface and Persona-based Methodology for Collecting Health Counseling Dialog. Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems.
- Landau, B.; and Jackendoff, R. 1991. Spatial language and spatial cognition. Bridges Between Psychology and Linguistics: A Swarthmore Festschrift for Lila Gleitman 145.
- Lascarides, A.; and Asher, N. 1991. Discourse relations and defeasible knowledge. In ACL.
- Lenat, D. B. 1995. CYC: A large-scale investment in knowledge infrastructure. Communications of the ACM.
- Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; and Zettlemoyer, L. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In ACL.
- Li, X.; Taheri, A.; Tu, L.; and Gimpel, K. 2016. Commonsense Knowledge Base Completion. In ACL.
- Lin, B. Y.; Chen, X.; Chen, J.; and Ren, X. 2019. KagNet: Knowledge-Aware Graph Networks for Commonsense Reasoning. In EMNLP/IJCNLP.
- Lin, C.-Y. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out.
- Liu, Y.; Yang, T.; You, Z.; Fan, W.; and Yu, P. S. 2020. Commonsense Evidence Generation and Injection in Reading Comprehension. In SIGDIAL.
- Miller, G. A. 1995. WordNet: A Lexical Database for English. Communications of the ACM.
- Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. BLEU: a method for automatic evaluation of machine translation. In ACL.
- Petroni, F.; Rocktaschel, T.; Lewis, P.; Bakhtin, A.; Wu, Y.; Miller, A. H.; and Riedel, S. 2019. Language Models as Knowledge Bases? In EMNLP.
- Radford, A.; Narasimhan, K.; Salimans, T.; and Sutskever, I. 2018. Improving language understanding by generative pre-training.
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; and Sutskever, I. 2019. Language Models are Unsupervised Multitask Learners.
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; and Liu, P. J. 2019. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. ArXiv abs/1910.10683.
- Roberts, A.; Raffel, C.; and Shazeer, N. 2020. How Much Knowledge Can You Pack Into the Parameters of a Language Model?
- Sap, M.; Le Bras, R.; Allaway, E.; Bhagavatula, C.; Lourie, N.; Rashkin, H.; Roof, B.; Smith, N. A.; and Choi, Y. 2019. ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning. In AAAI.
- Shwartz, V.; West, P.; Bras, R. L.; Bhagavatula, C.; and Choi, Y. 2020. Unsupervised Commonsense Question Answering with Self-Talk. ArXiv abs/2004.05483.
- Speer, R.; Chin, J.; and Havasi, C. 2017. Conceptnet 5.5: An open multilingual graph of general knowledge. In AAAI.
- Talmy, L. 1988. Force Dynamics in Language and Cognition. Cognitive Science 12: 49–100.
- Tamborrino, A.; Pellicano, N.; Pannier, B.; Voitot, P.; and Naudin, L. 2020. Pre-training Is (Almost) All You Need: An Application to Commonsense Reasoning. In ACL.
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. In Advances in Neural Information Processing Systems.
- Vedantam, R.; Lawrence Zitnick, C.; and Parikh, D. 2015. Cider: Consensus-based image description evaluation. In CVPR, 4566–4575.
- Wang, A.; Pruksachatkun, Y.; Nangia, N.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; and Bowman, S. R. 2019. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. ArXiv abs/1905.00537.
- Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; Davison, J.; Shleifer, S.; von Platen, P.; Ma, C.; Jernite, Y.; Plu, J.; Xu, C.; Scao, T. L.; Gugger, S.; Drame, M.; Lhoest, Q.; and Rush, A. M. 2019. HuggingFace’s Transformers: State-of-the-art Natural Language Processing. ArXiv abs/1910.03771.
- Zellers, R.; Holtzman, A.; Bisk, Y.; Farhadi, A.; and Choi, Y. 2019. HellaSwag: Can a Machine Really Finish Your Sentence? In ACL.
- Zhang, H.; Khashabi, D.; Song, Y.; and Roth, D. 2020a. TransOMCS: From Linguistic Graphs to Commonsense Knowledge. In IJCAI.
- Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.; and Artzi, Y. 2020b. BERTScore: Evaluating Text Generation with BERT. ArXiv abs/1904.09675.
- A ATOMIC 2020 Details
- Social-Interaction Commonsense. Social-interaction relations comment on socially-triggered states and behaviors. Social commonsense is useful for gauging people’s intentions and purpose, and predicting situationally-relevant human reactions and behaviors. Following the definitions for ATOMIC relations (Sap et al. 2019), we identify a total of nine relations within this category.
- For a discussion of force dynamics in cognitive linguistic and lexical semantic literature cf. Herskovits (2009); Landau and Jackendoff (1991); Talmy (1988).
Full Text
Tags
Comments