AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We develop a pre-training approach for incorporating commonsense knowledge into language representation models

Align, Mask and Select: A Simple Method for Incorporating Commonsense Knowledge into Language Representation Models

Cited by: 0|Views8
Full Text
Bibtex
Weibo

Abstract

Neural language representation models such as Bidirectional Encoder Representations from Transformers (BERT) pre-trained on large-scale corpora can well capture rich semantics from plain text, and can be fine-tuned to consistently improve the performance on various natural language processing (NLP) tasks. However, the existing pre-train...More

Code:

Data:

0
Introduction
Highlights
  • Pre-trained language representation models, including feature-based methods (Pennington, Socher, and Manning 2014; Peters et al 2017) and fine-tuning methods (Howard and Ruder 2018; Radford et al 2018; Devlin et al 2018), can capture rich language information from text and benefit many natural language processing (NLP) tasks
  • Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al 2018), as one of the most recently developed models, has produced the state-of-the-art results by simple fine-tuning on various NLP tasks, including named entity recognition (NER) (Sang and De Meulder 2003), text classification (Wang et al 2018), natural language inference (NLI) (Bowman et al 2015), question answering (QA) (Rajpurkar et al 2016; Zellers et al 2018), and has achieved human-level performances on several datasets (Rajpurkar et al 2016; Zellers et al 2018)
  • We propose a pre-training approach for incorporating commonsense knowledge that includes a method to construct largescale, natural language sentences. (Rajani et al 2019) collected the Common Sense Explanations (CoS-E) dataset using Amazon Mechanical Turk and applied a Commonsense Auto-Generated Explanations (CAGE) framework to language representation models, such as GPT and BERT
  • To investigate whether our multi-choice QA based pre-training approach degenerates the performance on common sentence classification tasks, we evaluate the BERT CSbase and
  • We develop a pre-training approach for incorporating commonsense knowledge into language representation models such as BERT
  • Experimental results demonstrate that pre-training models using the proposed approach followed by fine-tuning achieves significant improvements on various commonsense-related tasks, such as CommonsenseQA and Winograd Schema Challenge, while maintaining comparable performance on other NLP tasks, such as sentence classification and natural language inference (NLI) tasks, compared to the original BERT models
Methods
  • The authors investigate the performance of finetuning the BERT CS models on several NLP tasks.
  • The authors conduct experiments on a commonsense-related multi-choice question answering benchmark, the CommonsenseQA dataset (Talmor et al 2018).
  • The CommonsenseQA dataset consists of 12,247 questions with one correct answer and four distractor answers.
  • This dataset consists of two splits the question token split and the random split.
  • The statistics of the CommonsenseQA dataset are shown in Table 3
Results
  • As shown in Table 9, the model BERT CSlarge achieves the same performance on CLOSE set and better performance on FAR set than BERTlarge.
Conclusion
  • The authors develop a pre-training approach for incorporating commonsense knowledge into language representation models such as BERT.
  • Experimental results demonstrate that pre-training models using the proposed approach followed by fine-tuning achieves significant improvements on various commonsense-related tasks, such as CommonsenseQA and Winograd Schema Challenge, while maintaining comparable performance on other NLP tasks, such as sentence classification and natural language inference (NLI) tasks, compared to the original BERT models.
  • The authors plan to incorporate commonsense knowledge information into other language representation models such as XLNet (Yang et al 2019)
Summary
  • Introduction:

    Pre-trained language representation models, including feature-based methods (Pennington, Socher, and Manning 2014; Peters et al 2017) and fine-tuning methods (Howard and Ruder 2018; Radford et al 2018; Devlin et al 2018), can capture rich language information from text and benefit many NLP tasks.
  • Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al 2018), as one of the most recently developed models, has produced the state-of-the-art results by simple fine-tuning on various NLP tasks, including named entity recognition (NER) (Sang and De Meulder 2003), text classification (Wang et al 2018), natural language inference (NLI) (Bowman et al 2015), question answering (QA) (Rajpurkar et al 2016; Zellers et al 2018), and has achieved human-level performances on several datasets (Rajpurkar et al 2016; Zellers et al 2018).
  • As can be seen from the examples, it is easy for humans to answer the questions based on their knowledge about the world, it is a great challenge for machines when there is limited training data
  • Methods:

    The authors investigate the performance of finetuning the BERT CS models on several NLP tasks.
  • The authors conduct experiments on a commonsense-related multi-choice question answering benchmark, the CommonsenseQA dataset (Talmor et al 2018).
  • The CommonsenseQA dataset consists of 12,247 questions with one correct answer and four distractor answers.
  • This dataset consists of two splits the question token split and the random split.
  • The statistics of the CommonsenseQA dataset are shown in Table 3
  • Results:

    As shown in Table 9, the model BERT CSlarge achieves the same performance on CLOSE set and better performance on FAR set than BERTlarge.
  • Conclusion:

    The authors develop a pre-training approach for incorporating commonsense knowledge into language representation models such as BERT.
  • Experimental results demonstrate that pre-training models using the proposed approach followed by fine-tuning achieves significant improvements on various commonsense-related tasks, such as CommonsenseQA and Winograd Schema Challenge, while maintaining comparable performance on other NLP tasks, such as sentence classification and natural language inference (NLI) tasks, compared to the original BERT models.
  • The authors plan to incorporate commonsense knowledge information into other language representation models such as XLNet (Yang et al 2019)
Tables
  • Table1: Some examples from the CommonsenseQA dataset shown in part A and some related triples from ConceptNet shown in part B. The correct answers in part A are in boldface
  • Table2: The detailed procedures of constructing one multichoice question answering sample. The ∗ in the fourth step is a wildcard character. The correct answer for the question is underlined
  • Table3: The statistics of CommonsenseQA and Winograd Schema Challenge datasets
  • Table4: Accuracy (%) of different models on the CommonsenseQA test set
  • Table5: Accuracy (%) of different models on the Winograd Schema Challenge dataset together with its subsets and the WNLI test set. MTP denotes masked token prediction, which is employed in (<a class="ref-link" id="cKocijan_et+al_2019_a" href="#rKocijan_et+al_2019_a">Kocijan et al 2019</a>). MCQA denotes multi-choice question-answering format, which is employed in this paper
  • Table6: The accuracy (%) of different models on the GLUE test sets. We report Matthews corr. on CoLA, Spearman corr. on STS-B, accuracy on MNLI, QNLI, SST-2 and RTE, F1-score on QQP and MRPC, which is the same as (<a class="ref-link" id="cDevlin_et+al_2018_a" href="#rDevlin_et+al_2018_a">Devlin et al 2018</a>)
  • Table7: Accuracy (%) of different models on CommonsenseQA development set. The source data and tasks are employed to pre-train BERT CS. MCQA represents for multi-choice question answering task and MLM represents for masked language modeling task
  • Table8: Several cases from the Winograd Schema Challenge dataset. The pronouns in questions are in square brackets. The correct candidates and correct decisions by models are in boldface
  • Table9: The accuracy (%) of different models on two partitions of WSC dataset
Download tables as Excel
Related work
  • 2.1 Language Representation Model

    Language representation models have demonstrated their effectiveness for improving many NLP tasks. These approaches can be categorized into feature-based approaches and fine-tuning approaches. The early Word2Vec (Mikolov et al 2013) and Glove models (Pennington, Socher, and Manning 2014) focused on feature-based approaches to transform words into distributed representations. However, these methods suffered from the insufficiency for word disambiguation. (Peters et al 2018) further proposed Embeddings from Language Models (ELMo) that derive contextaware word vectors from a bidirectional LSTM, which is trained with a coupled language model (LM) objective on a large text corpus.

    The fine-tuning approaches are different from the abovementioned feature-based language approaches which only use the pre-trained language representations as input features. (Howard and Ruder 2018) pre-trained sentence encoders from unlabeled text and fine-tuned for a supervised downstream task. (Radford et al 2018) proposed a generative pre-trained Transformer (Vaswani et al 2017) (GPT) to learn language representations. (Devlin et al 2018) proposed a deep bidirectional model with multi-layer Transformers (BERT), which achieved the state-of-the-art performance for a wide variety of NLP tasks. The advantage of these approaches is that few parameters need to be learned from scratch.
Funding
  • The authors would like to thank Lingling Jin, Pengfei Fan, Xiaowei Lu for supporting 16 NVIDIA V100 GPU cards
Study subjects and analysis
multi-choice question answering samples: 16324846
If there are more than four distractors, we randomly select four distractors from them. After applying the AMS method, we create 16,324,846 multi-choice question answering samples. 5 Pre-training BERT CS

GLUE datasets: 8
Accuracy 58.2 59.1 59.4 59.9 58.8 60.8. BERT CSlarge models on 8 GLUE datasets and compare the performances with those from the baseline BERT models. Following (Devlin et al 2018), we use the batch size 32 and fine-tune for 3 epochs for all GLUE tasks, and select the fine-tuning learning rate (among 1e-5, 2e-5, and 3e-5) based on the performance on the development set

Reference
  • Bowman, S. R.; Angeli, G.; Potts, C.; and Manning, C. D. 2015. A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326.
    Findings
  • Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
    Findings
  • Emami, A.; De La Cruz, N.; Trischler, A.; Suleman, K.; and Cheung, J. C. K. 2018. A knowledge hunting framework for common sense reasoning. arXiv preprint arXiv:1810.01375.
    Findings
  • Howard, J., and Ruder, S. 2018. Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146.
    Findings
  • Kocijan, V.; Cretu, A.-M.; Camburu, O.-M.; Yordanov, Y.; and Lukasiewicz, T. 2019. A surprisingly robust trick for winograd schema challenge. arXiv preprint arXiv:1905.06290.
    Findings
  • Levesque, H.; Davis, E.; and Morgenstern, L. 2012. The winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning.
    Google ScholarLocate open access versionFindings
  • Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and Dean, J. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, 3111–3119.
    Google ScholarLocate open access versionFindings
  • Mintz, M.; Bills, S.; Snow, R.; and Jurafsky, D. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2, 1003–1011. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Pennington, J.; Socher, R.; and Manning, C. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 1532–1543.
    Google ScholarLocate open access versionFindings
  • Peters, M. E.; Ammar, W.; Bhagavatula, C.; and Power, R. 2017. Semi-supervised sequence tagging with bidirectional language models. arXiv preprint arXiv:1705.00108.
    Findings
  • Peters, M. E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; and Zettlemoyer, L. 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365.
    Findings
  • Radford, A.; Narasimhan, K.; Salimans, T.; and Sutskever, I. 2018. Improving language understanding by generative pretraining. URL https://s3-us-west-2.amazonaws.com/openaiassets/research-covers/languageunsupervised/language understanding paper.pdf.
    Findings
  • Rahman, A., and Ng, V. 2012. Resolving complex cases of definite pronouns: the winograd schema challenge. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 777–789. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Rajani, N. F.; McCann, B.; Xiong, C.; and Socher, R. 2019. Explain yourself! leveraging language models for commonsense reasoning. arXiv preprint arXiv:1906.02361.
    Findings
  • Rajpurkar, P.; Zhang, J.; Lopyrev, K.; and Liang, P. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.
    Findings
  • Rajpurkar, P.; Jia, R.; and Liang, P. 2018. Know what you don’t know: Unanswerable questions for squad. arXiv preprint arXiv:1806.03822.
    Findings
  • Riedel, S.; Yao, L.; and McCallum, A. 2010. Modeling relations and their mentions without labeled text. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 148–163. Springer.
    Google ScholarLocate open access versionFindings
  • Ruan, Y.-P.; Zhu, X.; Ling, Z.-H.; Shi, Z.; Liu, Q.; and Wei, S. 2019. Exploring unsupervised pretraining and sentence structure modelling for winograd schema challenge. arXiv preprint arXiv:1904.09705.
    Findings
  • Sang, E. F., and De Meulder, F. 2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition. arXiv preprint cs/0306050.
    Google ScholarFindings
  • Speer, R.; Chin, J.; and Havasi, C. 2017. Conceptnet 5.5: An open multilingual graph of general knowledge. In ThirtyFirst AAAI Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • Sun, K.; Yu, D.; Yu, D.; and Cardie, C. 2019. Probing prior knowledge needed in challenging chinese machine reading comprehension. arXiv preprint arXiv:1904.09679.
    Findings
  • Talmor, A.; Herzig, J.; Lourie, N.; and Berant, J. 2018. Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937.
    Findings
  • Trichelair, P.; Emami, A.; Cheung, J. C. K.; Trischler, A.; Suleman, K.; and Diaz, F. 2018. On the evaluation of common-sense reasoning in natural language understanding. arXiv preprint arXiv:1811.01778.
    Findings
  • Trinh, T. H., and Le, Q. V. 2018. A simple method for commonsense reasoning. arXiv preprint arXiv:1806.02847.
    Findings
  • Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. In Advances in neural information processing systems, 5998–6008.
    Google ScholarLocate open access versionFindings
  • Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; and Bowman, S. R. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461.
    Findings
  • Wu, Y.; Schuster, M.; Chen, Z.; Le, Q. V.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.; et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
    Findings
  • Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.; and Le, Q. V. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237.
    Findings
  • Zellers, R.; Bisk, Y.; Schwartz, R.; and Choi, Y. 2018. Swag: A large-scale adversarial dataset for grounded commonsense inference. arXiv preprint arXiv:1808.05326.
    Findings
  • Zhong, W.; Tang, D.; Duan, N.; Zhou, M.; Wang, J.; and Yin, J. 2018. Improving question answering by commonsensebased pre-training. arXiv preprint arXiv:1809.03568.
    Findings
Author
Your rating :
0

 

Tags
Comments
小科