AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts

Taylor Shin
Taylor Shin
Yasaman Razeghi
Yasaman Razeghi
Robert L Logan IV
Robert L Logan IV

EMNLP 2020, 2020.

Cited by: 0|Bibtex|Views29
Weibo:
We test linear probes— linear classifiers trained on top of frozen masked language models representations with average pooling —and find AUTOPROMPT has comparable or higher accuracy, despite linear probes being susceptible to false positives

Abstract:

The remarkable success of pretrained language models has motivated the study of what kinds of knowledge these models learn during pretraining. Reformulating tasks as fillin-the-blanks problems (e.g., cloze tests) is a natural approach for gauging such knowledge, however, its usage is limited by the manual effort and guesswork required to ...More

Code:

Data:

0
Introduction
  • Pretrained language models (LMs) have had exceptional success when adapted to downstream tasks via finetuning (Peters et al, 2018; Devlin et al, 2019).
  • Probing classifiers require additional learned parameters and are susceptible to false positives; high probing accuracy is not a sufficient condition to conclude that an LM contains a certain piece of knowledge (Hewitt and Liang, 2019; Voita and Titov, 2020)
  • Attention visualization, another common technique, has a similar failure mode: attention scores may be correlated with, but not caused by the underlying target knowledge, leading to criticism against their use as explanations (Jain and Wallace, 2019; Wiegreffe and Pinter, 2019).
Highlights
  • Pretrained language models (LMs) have had exceptional success when adapted to downstream tasks via finetuning (Peters et al, 2018; Devlin et al, 2019)
  • We test linear probes— linear classifiers trained on top of frozen masked language models (MLMs) representations with average pooling —and find AUTOPROMPT has comparable or higher accuracy, despite linear probes being susceptible to false positives
  • Table 4 shows the performance of MLMs with different prompting methods, and we show qualitative examples in Table 3 and in Appendix C
  • Using 7 trigger tokens achieves slightly higher scores than 5 trigger tokens, the difference is not substantial. This indicates that our approach is stable to the choice of trigger length, which is consistent with our sentiment analysis results
  • We introduce AUTOPROMPT, an approach to develop automatically-constructed prompts that elicit knowledge from pretrained MLMs for a variety of tasks
  • We focus only on masked language models in this paper, our method can be trivially extended to standard language models, and maybe useful for constructing inputs for models like GPT-3 (Brown et al, 2020)
Methods
  • Manual AUTOPROMPT BERT AUTOPROMPT RoBERTa Manual AUTOPROMPT BERT AUTOPROMPT RoBERTa Manual AUTOPROMPT BERT AUTOPROMPT RoBERTa Manual AUTOPROMPT BERT AUTOPROMPT RoBERTa Manual AUTOPROMPT BERT AUTOPROMPT RoBERTa Manual AUTOPROMPT BERT AUTOPROMPT RoBERTa Manual AUTOPROMPT BERT AUTOPROMPT RoBERTa Manual AUTOPROMPT BERT AUTOPROMPT RoBERTa Manual AUTOPROMPT BERT AUTOPROMPT RoBERTa Manual AUTOPROMPT BERT AUTOPROMPT RoBERTa Manual AUTOPROMPT BERT AUTOPROMPT RoBERTa Manual AUTOPROMPT BERT AUTOPROMPT RoBERTa Manual AUTOPROMPT BERT AUTOPROMPT RoBERTa Manual AUTOPROMPT BERT AUTOPROMPT RoBERTa. Prompt [X] works in the field of [Y] [X] probability earliest fame totaled studying [Y] [X] 1830 dissertation applying mathsucci [Y] The native language of [X] is [Y] [X]PA communerug speaks proper [Y] [X]neau optionally fluent!?
  • Prompt [X] works in the field of [Y] [X] probability earliest fame totaled studying [Y] [X] 1830 dissertation applying mathsucci [Y] The native language of [X] is [Y] [X]PA communerug speaks proper [Y] [X]neau optionally fluent!? ̈traditional [Y] [X] is a [Y] by profession [X] supporters studied politicians musician turned [Y] [X] (), astronomers businessman·former [Y] [X] is owned by [Y] [X] is hindwings mainline architecture within [Y] [X] picThom unwillingness officially governs [Y] [X] plays [Y] [X] playingdrum concertoative electric [Y] [X]Trump learned soloKeefe classical [Y] [X] plays [Y] music [X] freaking genre orchestra fiction acid [Y] [X] blends postwar hostage drama sax [Y] [X] is the capital of [Y] [X] boasts native territory traditionally called [Y] [X] limestone depositedati boroughDepending [Y] [X] is developed by [Y] [X] is memory arcade branding by [Y] [X] 1987 floppy simulator users sued [Y] [X] died in [Y] [X] reorganizationotype photographic studio in [Y] [X].. enigmatic twentieth nowadays near [Y] [X] is [Y] citizen [X] m3 badminton pieces internationally representing [Y] [X] offic organise forests statutes northwestern [Y] [X] is located in [Y] [X] consists kilograms centred neighborhoods in [Y] [X] manoeuv constructs whistleblowers hills near [Y] [X] is a subclass of [Y] [X] isı adequately termed coated [Y] [X],formerly prayers unstaceous [Y] The official language of [X] is [Y] [X]inen dialects resembled officially exclusively [Y] [X]onen tribes descending speak mainly [Y] [X] was written in [Y] [X] playedicevery dialect but [Y] [X] scaven pronunciation.*Wikipedia speaks [Y] [X] plays in [Y] position [X] played colors skier ↔ defensive [Y] [X],” (), ex-,Liverpool [Y]
Results
  • The authors show results in Table 1, along with reference scores from the GLUE (Wang et al, 2019) SST-2 leaderboard, and scores for a linear probe trained over the elementwise average of the LM token representations.
  • The authors test linear probes— linear classifiers trained on top of frozen MLM representations with average pooling —and find AUTOPROMPT has comparable or higher accuracy, despite linear probes being susceptible to false positives.
  • Overall, these results demonstrate that both BERT and RoBERTa have some inherent knowledge of natural language inference.
Conclusion
  • Prompting as an Alternative to Finetuning The goal of prompting a language model is to probe the knowledge that the model acquired from pretraining.
  • Due to the greedy search over the large discrete space of phrases, AUTOPROMPT is sometimes brittle; the authors leave more effective crafting techniques for future directions.In this paper, the authors introduce AUTOPROMPT, an approach to develop automatically-constructed prompts that elicit knowledge from pretrained MLMs for a variety of tasks.
  • The authors show that these prompts outperform manual prompts while requiring less human effort.
  • Source code and datasets to reproduce the results in this paper is available at http://ucinlp.github.io/autoprompt
Summary
  • Introduction:

    Pretrained language models (LMs) have had exceptional success when adapted to downstream tasks via finetuning (Peters et al, 2018; Devlin et al, 2019).
  • Probing classifiers require additional learned parameters and are susceptible to false positives; high probing accuracy is not a sufficient condition to conclude that an LM contains a certain piece of knowledge (Hewitt and Liang, 2019; Voita and Titov, 2020)
  • Attention visualization, another common technique, has a similar failure mode: attention scores may be correlated with, but not caused by the underlying target knowledge, leading to criticism against their use as explanations (Jain and Wallace, 2019; Wiegreffe and Pinter, 2019).
  • Objectives:

    Since the goal is to extract the object of relation triplets, rather than the relation itself, the authors tweak the standard RE evaluation.
  • Methods:

    Manual AUTOPROMPT BERT AUTOPROMPT RoBERTa Manual AUTOPROMPT BERT AUTOPROMPT RoBERTa Manual AUTOPROMPT BERT AUTOPROMPT RoBERTa Manual AUTOPROMPT BERT AUTOPROMPT RoBERTa Manual AUTOPROMPT BERT AUTOPROMPT RoBERTa Manual AUTOPROMPT BERT AUTOPROMPT RoBERTa Manual AUTOPROMPT BERT AUTOPROMPT RoBERTa Manual AUTOPROMPT BERT AUTOPROMPT RoBERTa Manual AUTOPROMPT BERT AUTOPROMPT RoBERTa Manual AUTOPROMPT BERT AUTOPROMPT RoBERTa Manual AUTOPROMPT BERT AUTOPROMPT RoBERTa Manual AUTOPROMPT BERT AUTOPROMPT RoBERTa Manual AUTOPROMPT BERT AUTOPROMPT RoBERTa Manual AUTOPROMPT BERT AUTOPROMPT RoBERTa. Prompt [X] works in the field of [Y] [X] probability earliest fame totaled studying [Y] [X] 1830 dissertation applying mathsucci [Y] The native language of [X] is [Y] [X]PA communerug speaks proper [Y] [X]neau optionally fluent!?
  • Prompt [X] works in the field of [Y] [X] probability earliest fame totaled studying [Y] [X] 1830 dissertation applying mathsucci [Y] The native language of [X] is [Y] [X]PA communerug speaks proper [Y] [X]neau optionally fluent!? ̈traditional [Y] [X] is a [Y] by profession [X] supporters studied politicians musician turned [Y] [X] (), astronomers businessman·former [Y] [X] is owned by [Y] [X] is hindwings mainline architecture within [Y] [X] picThom unwillingness officially governs [Y] [X] plays [Y] [X] playingdrum concertoative electric [Y] [X]Trump learned soloKeefe classical [Y] [X] plays [Y] music [X] freaking genre orchestra fiction acid [Y] [X] blends postwar hostage drama sax [Y] [X] is the capital of [Y] [X] boasts native territory traditionally called [Y] [X] limestone depositedati boroughDepending [Y] [X] is developed by [Y] [X] is memory arcade branding by [Y] [X] 1987 floppy simulator users sued [Y] [X] died in [Y] [X] reorganizationotype photographic studio in [Y] [X].. enigmatic twentieth nowadays near [Y] [X] is [Y] citizen [X] m3 badminton pieces internationally representing [Y] [X] offic organise forests statutes northwestern [Y] [X] is located in [Y] [X] consists kilograms centred neighborhoods in [Y] [X] manoeuv constructs whistleblowers hills near [Y] [X] is a subclass of [Y] [X] isı adequately termed coated [Y] [X],formerly prayers unstaceous [Y] The official language of [X] is [Y] [X]inen dialects resembled officially exclusively [Y] [X]onen tribes descending speak mainly [Y] [X] was written in [Y] [X] playedicevery dialect but [Y] [X] scaven pronunciation.*Wikipedia speaks [Y] [X] plays in [Y] position [X] played colors skier ↔ defensive [Y] [X],” (), ex-,Liverpool [Y]
  • Results:

    The authors show results in Table 1, along with reference scores from the GLUE (Wang et al, 2019) SST-2 leaderboard, and scores for a linear probe trained over the elementwise average of the LM token representations.
  • The authors test linear probes— linear classifiers trained on top of frozen MLM representations with average pooling —and find AUTOPROMPT has comparable or higher accuracy, despite linear probes being susceptible to false positives.
  • Overall, these results demonstrate that both BERT and RoBERTa have some inherent knowledge of natural language inference.
  • Conclusion:

    Prompting as an Alternative to Finetuning The goal of prompting a language model is to probe the knowledge that the model acquired from pretraining.
  • Due to the greedy search over the large discrete space of phrases, AUTOPROMPT is sometimes brittle; the authors leave more effective crafting techniques for future directions.In this paper, the authors introduce AUTOPROMPT, an approach to develop automatically-constructed prompts that elicit knowledge from pretrained MLMs for a variety of tasks.
  • The authors show that these prompts outperform manual prompts while requiring less human effort.
  • Source code and datasets to reproduce the results in this paper is available at http://ucinlp.github.io/autoprompt
Tables
  • Table1: Sentiment Analysis performance on the SST2 test set of supervised classifiers (top) and fill-in-theblank MLMs (bottom). Scores marked with † are from the GLUE leaderboard: http://gluebenchmark.com/ leaderboard
  • Table2: Natural Language Inference performance on the SICK-E test set and variants. (Top) Baseline classifiers. (Bottom) Fill-in-the-blank MLMs
  • Table3: Example Prompts by AUTOPROMPT for each task. On the left, we show the prompt template, which combines the input, a number of trigger tokens [T], and a prediction token [P]. For classification tasks (sentiment analysis and NLI), we make predictions by summing the model’s probability for a number of automatically selected label tokens. For fact retrieval and relation extraction, we take the most likely token predicted by the model
  • Table4: Factual Retrieval: On the left, we evaluate BERT on fact retrieval using the Original LAMA dataset from Petroni et al (2019). For all three metrics (mean reciprocal rank, mean precision-at-10 (P@10), and mean precision-at-1(P@1)), AUTOPROMPT significantly outperforms past prompting methods. We also report results on a T-REx version of the data (see text for details). On the right, we compare BERT versus RoBERTa on a subset of the LAMA data using AUTOPROMPT with 5 tokens
  • Table5: Relation Extraction: We use prompts to test pretrained MLMs on relation extraction. Compared to a state-of-the-art LSTM model from 2017, MLMs have higher mean precision-at-1 (P@1), especially when using prompts from AUTOPROMPT. We also test models on sentences that have been edited to contain incorrect facts. The accuracy of MLMs drops significantly on these sentences, indicating that their high performance stems from their factual knowledge
  • Table6: A breakdown of all relations for fact retrieval on the original dataset from Petroni et al (2019). We compare P@1 of prompts generated by LAMA, LPAQA, and our approach using five prompt tokens
  • Table7: Examples of manual prompts (first line, shown with BERT’s P@1) and prompts generated via AUTOPROMPT for Fact Retrieval
  • Table8: Examples of prompts generated using AUTOPROMPT for relation extraction. Underlined words represent the gold object. The bottom half of the Table shows examples of our augmented evaluation where the original objects (represented by crossed-out words) are replaced by new objects
Download tables as Excel
Funding
  • This material is based upon work sponsored by the DARPA MCS program under Contract No N660011924033 with the United States Office Of Naval Research
Study subjects and analysis
pairs: 10000
NLI is crucial in many tasks such as reading comprehension and commonsense reasoning (Bowman et al, 2015), and it is used as a common benchmark for language understanding. Setup We use the entailment task from the SICK dataset (Marelli et al, 2014, SICK-E) which consists of around 10,000 pairs of human-annotated sentences labeled as entailment, contradiction, and neutral. The standard dataset is biased toward the neutral class which represent 56.7% of instances

samples: 1000
To collect training data for AUTOPROMPT, we gather at most 1000 facts for each of the 41 relations in LAMA from the T-REx dataset (ElSahar et al, 2018). For the relations that still have less than 1000 samples, we gather extra facts straight from Wikidata. We ensure that none of the T-REx triples are present in the test set, and we split the data 80-20 into train and development sets

Reference
  • Zied Bouraoui, Jose Camacho-Collados, and Steven Schockaert. 2019. Inducing relational knowledge from BERT. In AAAI.
    Google ScholarLocate open access versionFindings
  • Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. 2015. A large annotated corpus for learning natural language inference. In EMNLP.
    Google ScholarFindings
  • Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165.
    Findings
  • Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. Technical report.
    Google ScholarFindings
  • Timo Schick and Hinrich Schutze. 2020. Exploiting cloze questions for few-shot text classification and natural language inference. arXiv preprint arXiv:2001.07676.
    Findings
  • Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2020. Unsupervised commonsense question answering with selftalk. arXiv preprint arXiv:2004.05483.
    Findings
  • Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP.
    Google ScholarFindings
  • Daniil Sorokin and Iryna Gurevych. 2017. Contextaware representations for knowledge base relation extraction. In EMNLP.
    Google ScholarFindings
  • Trieu H Trinh and Quoc V Le. 2018. A simple method for commonsense reasoning. arXiv preprint arXiv:1806.02847.
    Findings
  • Elena Voita and Ivan Titov. 2020. Informationtheoretic probing with minimum description length. In EMNLP.
    Google ScholarFindings
  • Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. 2019. Universal adversarial triggers for attacking and analyzing NLP. In EMNLP.
    Google ScholarFindings
  • Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2019. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In ICLR.
    Google ScholarFindings
  • Sarah Wiegreffe and Yuval Pinter. 2019. Attention is not not explanation. In EMNLP.
    Google ScholarFindings
  • Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2019. HuggingFace’s Transformers: State-of-theart natural language processing. arXiv preprint arXiv:1910.03771.
    Findings
Full Text
Your rating :
0

 

Tags
Comments