AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
View the video
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
We show that the NLU tasks can be formulated as conditional generation tasks, and solvable by autoregressive models
All NLP Tasks Are Generation Tasks: A General Pretraining Framework
There have been various types of pretraining architectures including autoregressive models (e.g., GPT), autoencoding models (e.g., BERT), and encoder-decoder models (e.g., T5). On the other hand, NLP tasks are different in nature, with three main categories being classification, unconditional generation, and conditional generation. Howe...More
PPT (Upload PPT)
- Large-scale language models pre-trained on web texts have substantially advanced the state of the art in various NLP tasks, such as natural langage understanding and text generation (Radford et al, 2018a; Devlin et al, 2019; Yang et al, 2019; Radford et al, 2018b; Raffel et al, 2020; Lewis et al, 2019; Brown et al, 2020).
- Autoregressive models, such as GPT (Radford et al, 2018a), learn leftto-right language models
- While they have succeeded in long-text generation and shown strong few-shot learning ability when scaled to billions of parameters (Radford et al, 2018b; Brown et al, 2020), the inherent disadvantage is that the unidirectional attention mechanism cannot fully capture the interaction of the context tokens.
- Autoencoding models, such as BERT (Devlin et al, 2019), learn bidirectional Transformers as context encoders via denoising objectives.
- Large-scale language models pre-trained on web texts have substantially advanced the state of the art in various NLP tasks, such as natural langage understanding and text generation (Radford et al, 2018a; Devlin et al, 2019; Yang et al, 2019; Radford et al, 2018b; Raffel et al, 2020; Lewis et al, 2019; Brown et al, 2020)
- Existing pretraining frameworks can be categorized into three families: augoregressive models, autoencoding models, and encoder-decoder models
- Autoregressive models, such as GPT (Radford et al, 2018a), learn leftto-right language models. While they have succeeded in long-text generation and shown strong few-shot learning ability when scaled to billions of parameters (Radford et al, 2018b; Brown et al, 2020), the inherent disadvantage is that the unidirectional attention mechanism cannot fully capture the interaction of the context tokens
- To make our pre-training method better suited for text generation tasks, we study a multi-task pre-training setup, where the model is jointly trained to reconstruct masked spans and generate longer text
- We introduce how our proposed model is finetuned for downstream natural language understanding and generation tasks
- We use the AdamW optimizer with peak learning rate 1e−5, warm-up over the first 6% training steps and a linear decay
- We show that the NLU tasks can be formulated as conditional generation tasks, and solvable by autoregressive models
- The authors describe the experiments in two different settings.
- The authors pre-train GLM with a single BERT-style objective and compare it with BERT-like models on NLU tasks.
- The authors show that the autoregressive blank filling pre-training, combined with the new formulation of classification tasks, can outperform finetuning bidirectional encoders with linear classifiers.
- The authors pre-train GLM with both the BERT-style objective and the generation objective.
- The authors show that GLM can effectively share model parameters for different tasks.
- The authors repeatedly sample new spans until more than 15% of the original tokens are masked.
- The authors use the AdamW optimizer with peak learning rate 1e−5, warm-up over the first 6% training steps and a linear decay.
- GLM Base scores 4.6% higher than BERTBase, and GLM Large scores 5.0% higher than BERTLarge.
- The learning rate has a peak value of 3e − 5, warm-up over the 6% training steps and a linear decay.
- The authors can observe that GLM Large can achieve performance matching or better than seq2seq and unified pre-training models.
- The base model fails to converge on the ReCoRD dataset, but the large model can achieve performance close to that of cloze-style finetuning
- GLM is a general pre-training framework for natural language understanding, generation and seq2seq.
- The authors show that the NLU tasks can be formulated as conditional generation tasks, and solvable by autoregressive models.
- GLM unifies the pre-training objectives for different tasks as autoregressive blank filling, with mixed attention mask and the novel 2D position encodings.
- The authors show that GLM outperforms previous methods for NLU tasks and can effectively share parameters for different tasks.
- The authors hope to scale GLM to larger transformer models and more pre-training data, and examine its performance in more settings such as knowledge probing and few-shot learning
- Table1: Summary of the pre-training frameworks. “Cond. Gen.” and “Uncond. Gen.” refer to conditional and unconditional text generation, respectively. “ ” means “is good at”, “—” means “could be adapted to”, and “×” means “cannot be directly applied to”. We define unconditional generation as the task of generating text without further training as in a standard language model, while conditional generation refers to seq2seq tasks such as text summarization
- Table2: Results on the SuperGLUE dev set. Models with * are pre-trained for two times the number of steps of other methods
- Table3: Results on Gigaword abstractive summarization
- Table4: Zero-shot language modeling results
- Table5: Ablation study on the SuperGLUE dev set
- Table6: Hyperparameters for pretraining
- Table7: Cloze questions and verbalizers for the 8 SuperGLUE tasks used in our experiments
- Pre-trained Language Models In NLP, self-supervised learning has long been used to learn word vectors as inputs to neural networks (Mikolov et al, 2013; Pennington et al, 2014). Recently, pre-training large-scale language models with self-supervised learning on abundant web texts significantly improves the performance on downstream tasks.
There are three types of pre-trained language models. The first type is the autoencoding model, which learns a bidirectional contextualized encoder for natural language understanding via denoising objectives. BERT (Devlin et al, 2019) pre-trains a large transformer model (Vaswani et al, 2017) via masked language modeling to obtain contextualized word representations. SpanBERT (Joshi et al, 2020) masks continuous spans of tokens for improved span representations. The second type is the autoregressive model, which learns an left-to-right language model for text generation. GPT (Radford et al, 2018a) shows that the representations learned by generative pre-training can also improve language understanding. XLNet (Yang et al, 2019) generalizes the autoregressive model with permutation language modeling to learn bidirectional attention for language understanding tasks. The third type is the encoder-decoder model pre-trained for seq2seq tasks. MASS (Song et al, 2019) maps an input text with continuous spans masked to the masked tokens. BART (Lewis et al, 2019) applies various transformations, including masking, deletion, and permutation, and recovers the original text with the decoder. PALM (Bi et al, 2020) is pre-trained for generating coherent text from given context and adds a BERT-based autoencoding objective to the encoder.
- We repeatedly sample new spans until more than 15% of the original tokens are masked
- We also use the AdamW optimizer with peak learning rate 1e−5, warm-up over the first 6% training steps and then a linear decay
- On average, GLM Base scores 4.6% higher than BERTBase, and GLM Large scores 5.0% higher than BERTLarge
- The learning rate has a peak value of 3e − 5, warm-up over the 6% training steps and a linear decay
- We can observe that GLM Large can achieve performance matching or better than seq2seq and unified pre-training models
- With classifier finetuning, the base model fails to converge on the ReCoRD dataset, but the large model can achieve performance close to that of cloze-style finetuning
- Athiwaratkun, B., dos Santos, C., Krone, J., and Xiang, B. Augmented natural language for generative sequence labeling. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 375–385, 2020.
- Bao, H., Dong, L., Wei, F., Wang, W., Yang, N., Liu, X., Wang, Y., Gao, J., Piao, S., Zhou, M., and Hon, H. Unilmv2: Pseudo-masked language models for unified language model pre-training. In ICML 2020, volume 119, pp. 642–652, 2020.
- Bi, B., Li, C., Wu, C., Yan, M., Wang, W., Huang, S., Huang, F., and Si, L. PALM: Pre-training an Autoencoding&Autoregressive Language Model for Contextconditioned Generation. In EMNLP 2020, pp. 8681–8691, 2020.
- Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. Language Models are Few-Shot Learners. In NeurIPS 2020, 2020.
- Clark, C., Lee, K., Chang, M.-W., Kwiatkowski, T., Collins, M., and Toutanova, K. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 2924–2936, 2019.
- Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL 2019, pp. 4171–4186, 2019.
- Donahue, C., Lee, M., and Liang, P. Enabling language models to fill in the blanks. arXiv preprint arXiv:2005.05339, 2020.
- Dong, L., Yang, N., Wang, W., Wei, F., Liu, X., Wang, Y., Gao, J., Zhou, M., and Hon, H. Unified language model pre-training for natural language understanding and generation. In NeurIPS 2019, pp. 13042–13054, 2019.
- OpenWebTextCorpus, 2019.
- Joshi, M., Chen, D., Liu, Y., Weld, D. S., Zettlemoyer, L., and Levy, O. SpanBERT: Improving Pre-training by Representing and Predicting Spans. Trans. Assoc. Comput. Linguistics, 8:64–77, 2020.
- Khashabi, D., Chaturvedi, S., Roth, M., Upadhyay, S., and Roth, D. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 252–262, 2018.
- Levesque, H., Davis, E., and Morgenstern, L. The winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning. Citeseer, 2012.
- Natural Language Generation, Translation, and Comprehension. In ACL 2020, pp. 7871–7880, 2019.
- Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692, 2019.
- Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. In ICLR 2019, 2019.
- Mackenzie, J., Benham, R., Petri, M., Trippas, J. R., Culpepper, J. S., and Moffat, A. CC-News-En: A Large English News Corpus. In CIKM 2020, pp. 3077–3084, 2020.
- Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. Distributed representations of words and phrases and their compositionality. In NIPS 2013, pp. 3111–3119, 2013.
- Paolini, G., Athiwaratkun, B., Krone, J., Ma, J., Achille, A., Anubhai, R., Santos, C. N. d., Xiang, B., and Soatto, S. Structured prediction as translation between augmented natural languages. arXiv preprint arXiv:2101.05779, 2021.
- Paperno, D., Kruszewski, G., Lazaridou, A., Pham, Q. N., Bernardi, R., Pezzelle, S., Baroni, M., Boleda, G., and Fernandez, R. The LAMBADA dataset: Word prediction requiring a broad discourse context. In ACL 2016, 2016.
- Pennington, J., Socher, R., and Manning, C. Glove: Global Vectors for Word Representation. In EMNLP 2014, pp. 1532–1543, 2014.
- Pereyra, G., Tucker, G., Chorowski, J., Kaiser, L., and Hinton, G. E. Regularizing neural networks by penalizing confident output distributions. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Workshop Track Proceedings, 2017.
- Pilehvar, M. T. and Camacho-Collados, J. Wic: the word-incontext dataset for evaluating context-sensitive meaning representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 1267–1273, 2019.
- Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. Improving Language Understanding by Generative Pre-Training. 2018a.
- Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Language models are unsupervised multitask learners. 2018b.
- Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res., 21:140:1–140:67, 2020.
- Roemmele, M., Bejan, C. A., and Gordon, A. S. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning, pp. 90–95, 2011.
- Rush, A. M., Chopra, S., and Weston, J. A neural attention model for abstractive sentence summarization. In EMNLP 2015, pp. 379–389, 2015.
- Schick, T. and Schutze, H. It’s not just size that matters: Small language models are also few-shot learners. CoRR, abs/2009.07118, 2020a.
- Schick, T. and Schutze, H. Exploiting cloze questions for few-shot text classification and natural language inference. CoRR, abs/2001.07676, 2020b.
- Shen, T., Quach, V., Barzilay, R., and Jaakkola, T. Blank language models. arXiv preprint arXiv:2002.03079, 2020.
- Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., and Catanzaro, B. Megatron-lm: Training multibillion parameter language models using model parallelism. CoRR, abs/1909.08053, 2019.
- Song, K., Tan, X., Qin, T., Lu, J., and Liu, T.-Y. MASS: Masked Sequence to Sequence Pre-training for Language Generation. In ICML 2019, volume 97, pp. 5926–5936, 2019.
- Trinh, T. H. and Le, Q. V. A Simple Method for Commonsense Reasoning. arXiv:1806.02847 [cs], 2019.
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In NIPS 2017, pp. 5999–6009, 2017.
- Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. In NeurIPS 2019, pp. 3261–3275, 2019.
- Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., and Le, Q. V. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In NeurIPS 2019, pp. 5754–5764, 2019.
- Zhang, S., Liu, X., Liu, J., Gao, J., Duh, K., and Van Durme, B. Record: Bridging the gap between human and machine commonsense reading comprehension. arXiv preprint arXiv:1810.12885, 2018.
- Zhu, Y., Kiros, R., Zemel, R. S., Salakhutdinov, R., Urtasun, R., Torralba, A., and Fidler, S. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In ICCV 2015, pp. 19–27, 2015.
- To train GLM Base and GLM Large, we use BookCorpus (Zhu et al., 2015) + Wikipedia used by BERT (Devlin et al., 2019).
- To train GLM RoBERTa, we follow the pre-training datasets of RoBERTa (Liu et al., 2019), which consist of BookCorups (Zhu et al., 2015) + Wikipedia (16GB), CC-News (the English portion of the CommonCrawl News dataset2 76GB), OpenWebText (web content extracted from URLs shared on Reddit with at least three upvotes(Gokaslan & Cohen, 2019), 38GB) and Stories (subset of CommonCrawl data filtered to match the story-like style of Winograd schemas (Trinh & Le, 2019), 31GB). The Stories dataset is already unavailable3.
- Therefore, we remove the Stories dataset and replace OpenWebText with OpenWebText24 (66GB). The CC-News dataset is not publicly available and we use the CC-News-en published by (Mackenzie et al., 2020). All the datasets used total 158GB of uncompressed texts, close in size to RoBERTa’s 160GB datasets.
- (2) We use cosine decay instead of linear decay for learning rate scheduling (3) We additionally apply gradient clipping with value 1.0.