AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
We show that our P-tuning method can recover 64% of world knowledge from a pre-trained language model without any additional text provided during test time

GPT Understands, Too

被引用0|浏览497
下载 PDF 全文
引用
微博一下

摘要

While GPTs with traditional fine-tuning fail to achieve strong results on natural language understanding (NLU), we show that GPTs can be better than or comparable to similar-sized BERTs on NLU tasks with a novel method P-tuning -- which employs trainable continuous prompt embeddings. On the knowledge probing (LAMA) benchmark, the best G...更多

代码

数据

0
简介
  • Language model pre-training has been a successful approach for many natural language processing tasks (Brown et al, 2020).
  • Researchers have observed that GPT-style models perform poorly for NLU tasks with fine-tuning, and assumed that they are not suitable for language understanding in nature.
  • Its success suggests that giant unidirectional language models together with appropriate manual prompt may work for natural language understanding.
  • It is easy to create adversarial prompts that result in a substantial performance decrease.
  • In light of these problems, recent works have focused on automatically searching discrete prompts (Jiang et al, 2020b; Shin et al, 2020; Reynolds & McDonell, 2021; Gao et al, 2020) and demonstrated their effectiveness.
  • Since neural networks are inherently continuous, discrete prompts can be sub-optimal
重点内容
  • Language model pre-training has been a successful approach for many natural language processing tasks (Brown et al, 2020)
  • According to the training objectives, pre-trained language models can be divided into three categories: unidirectional language models (e.g., GPT (Radford et al, 2019)) for natural language generation (NLG), bidirectional language models (e.g., BERT (Devlin et al, 2018)) for natural lan
  • Its success suggests that giant unidirectional language models together with appropriate manual prompt may work for natural language understanding
  • We show that GPTs can be as competitive as BERTs in natural language understanding with P-tuning, which can boost pre-trained language models’ performance
  • We show that our P-tuning method can recover 64% (P@1) of world knowledge from a pre-trained language model without any additional text provided during test time
  • With P-tuning, our method outperforms state-of-the-art methods on LAMA knowledge probing and few-shot SuperGlue, which indicates that language models have grasped more world knowledge and prior-task knowledge during pre-training than we previously thought
  • On the SuperGLUE benchmark, P-tuning endows GPT-style models to show competitive performance with similar-size BERTs in natural language understanding, which is assumed impossible in the past
方法
  • The P-tuning only applies noninvasive modification to the input.
  • The Ptuning replaces the input embeddings of pre-trained language models with its differential output embeddings.
  • Given a pre-trained language model M, a sequence of discrete input tokens x1:n = {x0, x1, ..., xn} will be mapped to input embeddings {e(x0), e(x1), ..., e(xn)} by the pretrained embedding layer e ∈ M.
  • Condition on the context x, the authors often use the output embeddings of a set of target tokens y for downstream processing.
  • In the pre-training, x refers to the unmasked tokens while y refers to the [MASK] ones; and in the sentence classification, x refers to the sentence tokens while y often refers to the [CLS].
  • WSC COPA (Acc.) (Acc.) Avg. BERT-base-cased (109M) Fine-tuning MP zero-shot
结果
  • The P-tuning significantly pushes the boundary of knowledge probing from 43.3% to 50.6% in LAMA-34k and 45.2% to a maximum of 64.2% in LAMA-29k.
  • This result strongly suggests that language models capture far more knowledge than people previously believed by merely finding a better prompt and without fine-tuning.
  • It is not allowed to change the pre-trained model’s parameters by fine-tuning in traditional knowledge probing.
  • The authors seek to evaluate how much knowledge has language models learned during pre-training.
  • This work’s essential aspect is to compare P-tuning and fine-tuning, on unidirectional language models like GPT.
  • The authors are especially interested in the following question: Are unidirectional and bidirectional language models gaining similar improvement from P-tuning ?
结论
  • The authors present–P-tuning–which augments pretrained model’s ability in natural language understanding by automatically searching better prompts in the continuous space.
  • The authors' P-tuning method relies less on a large validation set, suffers less from adversarial prompts, and alleviates over-fitting.
  • The authors show that the P-tuning method can recover 64% (P@1) of world knowledge from a pre-trained language model without any additional text provided during test time.
  • On the SuperGLUE benchmark, P-tuning endows GPT-style models to show competitive performance with similar-size BERTs in natural language understanding, which is assumed impossible in the past.
  • P-tuning helps on bidirectional models and outperforms stateof-the-art methods in the few-shot SuperGlue benchmark.
  • It proves that language models effectively capture more world knowledge and prior-task knowledge than the authors thought during pre-training
表格
  • Table1: Case study on LAMA-TREx P17 with bert-base-cased. A single-word change in prompts could yield a drastic difference
  • Table2: Knowledge probing Precision@1 on LAMA-34k (left) and LAMA-29k (right). P-tuning outperforms all the discrete prompt searching baselines. And interestingly, despite fixed pre-trained model parameters, P-tuning overwhelms the fine-tuning GPTs in LAMA-29k. (MP: Manual prompt; FT: Fine-tuning; MP+FT: Manual prompt augmented fine-tuning; PT: P-tuning )
  • Table3: Fully-supervised learning on SuperGLUE dev with base-scale models. MP refers to manual prompt. For a fair comparison, MP zero-shot and MP fine-tuning report results of a single pattern, while anchors for P-tuning are selected from the same prompt. Subscript in red represents advantages of GPT with P-tuning over the best results of BERT
  • Table4: Fully-supervised learning on SuperGLUE dev with large-scale models. MP refers to manual prompt. For fair comparison, MP zero-shot and MP fine-tuning report results of a single pattern, while anchors for P-tuning are selected from the same prompt. Subscripts in red represents improvements of GPT with P-tuning over the best results of BERT
  • Table5: Few-shot learning (32 train samples) on SuperGLUE dev. Previous few-shot learning approaches use the original full dev set (Ddev) for validation, which does not make sense. We construct a new dev set (Ddev32) with 32 unused samples from original training set. Under fair comparison, P-tuning significantly outperforms PET (Ddev32) and PET best (Ddev32) on all tasks. More interestingly, P-tuning even outperforms GPT-3, PET (Ddev) and iPET (Ddev) on 4 out of 7 tasks. Subscripts in red represents the improvements of P-tuning over PET(Ddev32)
  • Table6: Few-shot performance comparison of different manual prompts and tuned prompts on RTE tasks using albert-xxlarge-v2. Experiments use Ddev32 for model selection and hyper-parameter tuning and evaluate on Ddev. There’s no obvious correlations between manual prompts and performance. Besides, Ddev32 is not able to select the best manual prompts
Download tables as Excel
相关工作
  • 5.1. Pre-trained Language Models

    The recent breakthrough in self-supervised (Liu et al, 2020) pre-trained language models has boosted the development of natural language processing. GPT (Radford et al, 2019) first leverages the transformer architecture to pre-train on

    Prompt Does [PRE] agree with [HYP]? [MASK]. Does [HYP] agree with [PRE]? [MASK]. Premise: [PRE] Hypothesis: [HYP] Answer: [MASK]. [PRE] question: [HYP]. true or false? answer: [MASK]. P-tuning large-scale web texts. BERT (Devlin et al, 2018) proposes the masked language modeling and creates the pre-train/finetuning paradigm. Later on, various kinds of language models grown up, including XLNet (Yang et al, 2019) which innovates the permutation language modeling. RoBERTa (Liu et al, 2019) conducts detailed experiments to demonstrate useful techniques related to pre-training. BART (Lewis et al, 2019), T5 (Raffel et al, 2019) and UniLM (Dong et al, 2019) which try to unify the language understanding and generation.
基金
  • With P-tuning, our method outperforms state-of-the-art methods on LAMA knowledge probing and few-shot SuperGlue, which indicates that language models have grasped more world knowledge and prior-task knowledge during pre-training than we previously thought
研究对象与分析
train samples: 32
Fully-supervised learning on SuperGLUE dev with large-scale models. MP refers to manual prompt. For fair comparison, MP zero-shot and MP fine-tuning report results of a single pattern, while anchors for P-tuning are selected from the same prompt. Subscripts in red represents improvements of GPT with P-tuning over the best results of BERT. Few-shot learning (32 train samples) on SuperGLUE dev. Previous few-shot learning approaches use the original full dev set (Ddev) for validation, which does not make sense. We construct a new dev set (Ddev32) with 32 unused samples from original training set. Under fair comparison, P-tuning significantly outperforms PET (Ddev32) and PET best (Ddev32) on all tasks. More interestingly, P-tuning even outperforms GPT-3, PET (Ddev) and iPET (Ddev) on 4 out of 7 tasks. Subscripts in red represents the improvements of P-tuning over PET(Ddev32). Few-shot performance comparison of different manual prompts and tuned prompts on RTE tasks using albert-xxlarge-v2. Experiments use Ddev32 for model selection and hyper-parameter tuning and evaluate on Ddev. There’s no obvious correlations between manual prompts and performance. Besides, Ddev32 is not able to select the best manual prompts

引用论文
  • Allen-Zhu, Z., Li, Y., and Song, Z. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pp. 242–252. PMLR, 2019.
    Google ScholarLocate open access versionFindings
  • Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
    Findings
  • Clark, C., Lee, K., Chang, M.-W., Kwiatkowski, T., Collins, M., and Toutanova, K. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 2924–2936, 2019a.
    Google ScholarLocate open access versionFindings
  • Clark, K., Khandelwal, U., Levy, O., and Manning, C. D. What does bert look at? an analysis of bert’s attention. arXiv preprint arXiv:1906.04341, 2019b.
    Findings
  • Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q. V., and Salakhutdinov, R. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
    Findings
  • Davison, J., Feldman, J., and Rush, A. M. Commonsense knowledge mining from pretrained models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pp. 1173–1178, 2019.
    Google ScholarLocate open access versionFindings
  • De Marneffe, M.-C., Simons, M., and Tonhauser, J. The commitmentbank: Investigating projection in naturally occurring discourse. In proceedings of Sinn und Bedeutung, volume 23, pp. 107–124, 2019.
    Google ScholarLocate open access versionFindings
  • Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
    Findings
  • Dong, L., Yang, N., Wang, W., Wei, F., Liu, X., Wang, Y., Gao, J., Zhou, M., and Hon, H.-W. Unified language model pre-training for natural language understanding and generation. arXiv preprint arXiv:1905.03197, 2019.
    Findings
  • Hewitt, J. and Manning, C. D. A structural probe for finding syntax in word representations. In North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL). Association for Computational Linguistics, 2019.
    Google ScholarLocate open access versionFindings
  • Jiang, Z., Anastasopoulos, A., Araki, J., Ding, H., and Neubig, G. X-factr: Multilingual factual knowledge retrieval from pretrained language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 5943–5959, 2020a.
    Google ScholarLocate open access versionFindings
  • Jiang, Z., Xu, F. F., Araki, J., and Neubig, G. How can we know what language models know? Transactions of the Association for Computational Linguistics, 8:423–438, 2020b.
    Google ScholarLocate open access versionFindings
  • Khashabi, D., Chaturvedi, S., Roth, M., Upadhyay, S., and Roth, D. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 252–262, 2018.
    Google ScholarLocate open access versionFindings
  • Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 1207–1216, Stanford, CA, 2000. Morgan Kaufmann.
    Google ScholarLocate open access versionFindings
  • Levesque, H., Davis, E., and Morgenstern, L. The winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning. Citeseer, 2012.
    Google ScholarLocate open access versionFindings
  • Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., and Zettlemoyer, L. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019.
    Findings
  • Li, X. L. and Liang, P. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
    Findings
  • Liu, J., Shen, D., Zhang, Y., Dolan, B., Carin, L., and Chen, W. What makes good in-context examples for gpt-3? arXiv preprint arXiv:2101.06804, 2021.
    Findings
  • Liu, X., Zhang, F., Hou, Z., Wang, Z., Mian, L., Zhang, J., and Tang, J. Self-supervised learning: Generative or contrastive. arXiv preprint arXiv:2006.08218, 1(2), 2020.
    Findings
  • Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
    Findings
  • Petroni, F., Rocktaschel, T., Lewis, P., Bakhtin, A., Wu, Y., Miller, A. H., and Riedel, S. Language models as knowledge bases? arXiv preprint arXiv:1909.01066, 2019.
    Findings
  • Petroni, F., Lewis, P., Piktus, A., Rocktaschel, T., Wu, Y., Miller, A. H., and Riedel, S. How context affects language models’ factual predictions. arXiv preprint arXiv:2005.04611, 2020.
    Findings
  • Pilehvar, M. T. and Camacho-Collados, J. Wic: 10, 000 example pairs for evaluating context-sensitive representations. CoRR, abs/1808.09121, 2018.
    Findings
  • Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
    Google ScholarLocate open access versionFindings
  • Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019.
    Findings
  • Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., and Sutskever, I. Zero-shot textto-image generation. arXiv preprint arXiv:2102.12092, 2021.
    Findings
  • Reynolds, L. and McDonell, K. Prompt programming for large language models: Beyond the few-shot paradigm. arXiv preprint arXiv:2102.07350, 2021.
    Findings
  • Roemmele, M., Bejan, C. A., and Gordon, A. S. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning, pp. 90–95, 2011.
    Google ScholarFindings
  • Schick, T. and Schutze, H. It’s not just size that matters: Small language models are also few-shot learners. Computing Research Repository, arXiv:2009.07118, 2020.
    Findings
  • Shin, T., Razeghi, Y., Logan IV, R. L., Wallace, E., and Singh, S. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980, 2020.
    Findings
  • Vig, J. A multiscale visualization of attention in the transformer model. arXiv preprint arXiv:1906.05714, 2019.
    Findings
  • Language Understanding Systems. In NeurIPS 2019, pp. 3261–3275, 2019a.
    Google ScholarLocate open access versionFindings
  • Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. Superglue: A stickier benchmark for general-purpose language understanding systems. arXiv preprint arXiv:1905.00537, 2019b.
    Findings
  • Wang, C., Liu, X., and Song, D. Language models are open knowledge graphs. arXiv preprint arXiv:2010.11967, 2020.
    Findings
  • Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., and Le, Q. V. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237, 2019.
    Findings
  • Yue, Z., Zhang, H., Sun, Q., and Hua, X.-S. Interventional few-shot learning. arXiv preprint arXiv:2009.13000, 2020.
    Findings
  • Zhang, S., Liu, X., Liu, J., Gao, J., Duh, K., and Van Durme, B. Record: Bridging the gap between human and machine commonsense reading comprehension. arXiv preprint arXiv:1810.12885, 2018.
    Findings
  • Zhao, T. Z., Wallace, E., Feng, S., Klein, D., and Singh, S. Calibrate before use: Improving few-shot performance of language models. arXiv preprint arXiv:2102.09690, 2021.
    Findings
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科