On Losses for Modern Language Models

EMNLP 2020, pp.4970-4981, (2020)

被引用0|浏览165
下载 PDF 全文
引用
微博一下

摘要

BERT set many state-of-the-art results over varied NLU benchmarks by pre-training over two tasks: masked language modelling (MLM) and next sentence prediction (NSP), the latter of which has been highly criticized. In this paper, we 1) clarify NSP’s effect on BERT pre-training, 2) explore fourteen possible auxiliary pre-training tasks, of ...更多

代码

数据

0
简介
重点内容
  • When Devlin et al (2018) released Bidirectional Encoder Representations from Transformers (BERT), a transformer network (Vaswani et al, 2017) trained using a ‘masked language model’ (MLM) task and a ‘ sentence prediction’ (NSP), it redefined the NLP landscape, establishing itself as the state-of-the-art (SoTA) on many natural language understanding (NLU) benchmarks including the GLUE (Wang et al, 2018), SQuAD (Rajpurkar et al, 2016), and SWAG (Zellers et al, 2018) benchmarks.

    Many models inspired by BERT have since surpassed its performance
  • We find σMLM = 0.198, σNSP = 0.222, σCMT L+ = 0.273, and use the highest, σ = 0.273, as an estimate for the standard deviation across all experiments
  • The results show that the CMTL+ model – trained on masked language modelling (MLM), Quick Thoughts variant (QT), Sentence Ordering (SO), and Term Frequency prediction (TF)-IDF in a continual multi-task learning framework – vastly outperforms the MLM baseline in every task
  • Our model trained on 32 billion tokens outperforms the original BERTBase, which required 137 billion tokens
  • We investigate and support several reasons why next-sentence prediction is ill-suited for BERT pretraining, we provide better inference-based alternatives, and we develop other novel auxiliary tasks based on word importance and soft clustering that provide substantial benefits to BERT pre-training
  • We demonstrate the benefit of multi-task learning in BERT pre-training, and identify key factors on how to best combine tasks
方法
  • The authors' primary motivation in this paper is to study and survey auxiliary pre-training tasks for multi-task learning for modern language understanding models
  • In this case, ‘modern’ is a transformer-based model pre-trained on a large unlabelled corpus using a form of masked language modelling.
结果
  • 4.1 Understanding NSP

    the authors present the results from an array of different tests.
  • The authors boldface all average GLUE scores that are two estimated standard deviations above the MLM baseline.
  • The authors train the baseline MLM model and CMTL+ model on 32 billion tokens and present the results using the GLUE and SuperGLUE evaluation servers in Tables 3 and 4 respectively.
  • While the authors include larger models – BERTLarge, RoBERTa, and T5 – in the tables for context, the authors remind the readers that these results are not comparable to the results.
  • While the results are not comparable, the authors hope that the tasks the authors used in the model can be utilized by newer and larger models to improve their understanding of language
结论
  • The authors' results support several recent papers: the authors support Liu et al (2019); Yang et al (2019); Joshi et al (2019)’s claim that NSP hinders BERT pretraining, especially for non-inference tasks, due to cutting context half the time; the authors reinforce Cheng et al (2019); Wang et al (2020)’s proposal that NSP prediction is a semantically shallow and often solvable through lexical overlap and that using a task that requires understanding the ordering of contiguous text provides a stronger semantic signal; and the authors uphold Sun et al (2019a,b)’s idea that a language model should be trained in a multi-task setting.
  • Providing a signal that relays word importance, such as TF-IDF and TF, likewise produces substantial benefit to BERT pre-training.
  • The authors demonstrate the value of multi-task learning for language model pre-training; combining multiple beneficial tasks leads to better results than using any of the individual tasks alone.The authors investigate and support several reasons why next-sentence prediction is ill-suited for BERT pretraining, the authors provide better inference-based alternatives, and the authors develop other novel auxiliary tasks based on word importance and soft clustering that provide substantial benefits to BERT pre-training.
  • The authors hope the insights provided here will help guide the development of better language models in the future
总结
  • Introduction:

    When Devlin et al (2018) released BERT, a transformer network (Vaswani et al, 2017) trained using a ‘masked language model’ (MLM) task and a ‘ sentence prediction’ (NSP), it redefined the NLP landscape, establishing itself as the state-of-the-art (SoTA) on many natural language understanding (NLU) benchmarks including the GLUE (Wang et al, 2018), SQuAD (Rajpurkar et al, 2016), and SWAG (Zellers et al, 2018) benchmarks.

    Many models inspired by BERT have since surpassed its performance.
  • In contrast to the original BERT paper, many obtained better results by excluding the NSP task
  • Some, such as XLNET (Yang et al, 2019) and RoBERTa (Liu et al, 2019), rely solely on a MLM variant, while others (Wang et al, 2020; Joshi et al, 2019; Cheng et al, 2019; Sun et al, 2019b) incorporate one or more different auxiliary loss functions.
  • Methods:

    The authors' primary motivation in this paper is to study and survey auxiliary pre-training tasks for multi-task learning for modern language understanding models
  • In this case, ‘modern’ is a transformer-based model pre-trained on a large unlabelled corpus using a form of masked language modelling.
  • Results:

    4.1 Understanding NSP

    the authors present the results from an array of different tests.
  • The authors boldface all average GLUE scores that are two estimated standard deviations above the MLM baseline.
  • The authors train the baseline MLM model and CMTL+ model on 32 billion tokens and present the results using the GLUE and SuperGLUE evaluation servers in Tables 3 and 4 respectively.
  • While the authors include larger models – BERTLarge, RoBERTa, and T5 – in the tables for context, the authors remind the readers that these results are not comparable to the results.
  • While the results are not comparable, the authors hope that the tasks the authors used in the model can be utilized by newer and larger models to improve their understanding of language
  • Conclusion:

    The authors' results support several recent papers: the authors support Liu et al (2019); Yang et al (2019); Joshi et al (2019)’s claim that NSP hinders BERT pretraining, especially for non-inference tasks, due to cutting context half the time; the authors reinforce Cheng et al (2019); Wang et al (2020)’s proposal that NSP prediction is a semantically shallow and often solvable through lexical overlap and that using a task that requires understanding the ordering of contiguous text provides a stronger semantic signal; and the authors uphold Sun et al (2019a,b)’s idea that a language model should be trained in a multi-task setting.
  • Providing a signal that relays word importance, such as TF-IDF and TF, likewise produces substantial benefit to BERT pre-training.
  • The authors demonstrate the value of multi-task learning for language model pre-training; combining multiple beneficial tasks leads to better results than using any of the individual tasks alone.The authors investigate and support several reasons why next-sentence prediction is ill-suited for BERT pretraining, the authors provide better inference-based alternatives, and the authors develop other novel auxiliary tasks based on word importance and soft clustering that provide substantial benefits to BERT pre-training.
  • The authors hope the insights provided here will help guide the development of better language models in the future
表格
  • Table1: Test results on GLUE development set for models pre-trained on MLM (No Aux.) and MLM + auxiliary tasks trained over 10 billion tokens. F1 scores are reported for QQP and MRPC, Spearman correlations are reported for STS-B, and accuracy scores are reported for the other tasks. Refer to section 3.2 for a description of each task. Best results in each column are underlined. Averages above two estimated σs of the MLM baseline are bolded
  • Table2: Results on GLUE development set for models pre-trained on MLM (our baseline), MLM + QT (best single auxiliary task model) and different combinations of the best performing tasks. Refer to section 3.3 for more detail. Best results in each column are underlined. Averages above two estimated σs of the MLM baseline are bolded
  • Table3: GLUE test results, scored by the evaluation server excluding the problematic WNLI task. Matched/mismatched accuracy are reported for MNLI, F1 scores are reported for QQP and MRPC, Spearman correlations are reported for STS-B, and accuracy scores are reported for the other tasks. The BERTBase results are from the original BERT paper (<a class="ref-link" id="cDevlin_et+al_2018_a" href="#rDevlin_et+al_2018_a">Devlin et al, 2018</a>). The MLM baseline and CMTL+ models are our implementations. We include the performance of our models on the development set for reproducibility. Best results in each column for models of comparable size are underlined. For context, we additionally include results from the GLUE leaderboard for BERTLarge, RoBERTa, and T5, and their respective size measured by number of parameters. BERTBase, MLM baseline, and CMTL+ all have a size of 110M parameters
  • Table4: SuperGLUE test results, scored by the evaluation server. Both models use most common class prediction for ReCoRD and WSC. The MLM baseline also uses most common class prediction for MultiRC. Best results in each column for models of comparable size are underlined. For context, we additionally include results from the SuperGLUE leaderboard for BERTLarge and T5, and their respective size measured by number of parameters. CMTL+ and MLM baseline both have sizes of 110M parameters
  • Table5: Training using CMTL with 4 tasks over 200k total iterations. Example from <a class="ref-link" id="cSun_et+al_2019_b" href="#rSun_et+al_2019_b">Sun et al (2019b</a>)
  • Table6: Training using CMTL with 3 tasks over 10B total iterations
  • Table7: Average GLUE score results on 5 different trainings
Download tables as Excel
相关工作
  • As with most deep learning, language representations require large datasets. While there exists corpora of labelled text, the vast majority of language data exists as raw, unlabelled text. Accordingly, many language embedding methods, and all those described below, rely solely on unsupervised or self-supervised tasks.

    2.1 Pre-transformer sentence embeddings

    Skip-Thoughts (Kiros et al, 2015) was the first deep learning sentence embedding model. Its training objective, inspired by word2vec (Mikolov et al, 2013), used RNNs to reconstruct the previous and next sentence from a given sentence. Like word2vec, similar sentences shared similar embeddings, and while it exhibited promising results, it was slow to train due to its encoding and double decoding of sentences through RNNs. Hill et al (2016)’s FastSent tried to follow the same sequential sentence paradigm at a reduced training cost by encoding a sentence using a bag-of-words approach and maximizing the probability of words in adjacent sentences. Later, Quick Thoughts (Logeswaran and Lee, 2018) managed to maintain the sequential sentences objective while supporting ordered words. Using two RNN models, f (s) and g(s), they embedded a first set of sentences using f (s) and a second set consisting of the subsequent sentences using g(s). They jointly train the two models to predict the consecutive sentences from a set of candidates by comparing inner products. This resembles a referential game (David, 1969) where f (s) and g(s) are the sender and receiver respectively.
基金
  • Rudzicz is supported by a CIFAR Chair in Artificial Intelligence
引用论文
  • Siddhartha Brahma. 2018. Unsupervised learning of sentence representations using sequence consistency. arXiv preprint arXiv:1808.04217.
    Findings
  • Rich Caruana. 1997. ”multitask learning”. Machine Learning, 28(1):41–75.
    Google ScholarLocate open access versionFindings
  • Xingyi Cheng, Weidi Xu, Kunlong Chen, Wei Wang, Bin Bi, Ming Yan, Chen Wu, Luo Si, Wei Chu, and Taifeng Wang. 2019. Symmetric Regularization based BERT for Pair-wise Semantic Reasoning. arXiv preprint arXiv:1909.03405.
    Findings
  • Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. 2019. Transformer-XL: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860.
    Findings
  • Lewis David. 1969. Convention: a philosophical study. Cambridge, Harvard University Press.
    Google ScholarFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR, abs/1810.04805.
    Findings
  • Dan Hendrycks and Kevin Gimpel. 2016. Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units. CoRR, abs/1606.08415.
    Findings
  • Felix Hill, Kyunghyun Cho, and Anna Korhonen. 2016. Learning distributed representations of sentences from unlabelled data. arXiv preprint arXiv:1602.03483.
    Findings
  • Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S Weld, Luke Zettlemoyer, and Omer Levy. 201SpanBERT: Improving pre-training by representing and predicting spans. arXiv preprint arXiv:1907.10529.
    Findings
  • Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
    Findings
  • Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in neural information processing systems, pages 3294–3302.
    Google ScholarLocate open access versionFindings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692.
    Findings
  • Lajanugen Logeswaran and Honglak Lee. 2018. An efficient framework for learning sentence representations. arXiv preprint arXiv:1803.02893.
    Findings
  • Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
    Findings
  • Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhutdinov. 20Actor-mimic: Deep multitask and transfer reinforcement learning. arXiv preprint arXiv:1511.06342.
    Findings
  • Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding with unsupervised learning. Technical report, Technical report, OpenAI.
    Google ScholarFindings
  • Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9.
    Google ScholarLocate open access versionFindings
  • Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683.
    Findings
  • Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text. arXiv preprint arXiv:1606.05250.
    Findings
  • Mengye Ren, Eleni Triantafillou, Sachin Ravi, Jake Snell, Kevin Swersky, Joshua B Tenenbaum, Hugo Larochelle, and Richard S Zemel. 2018. Metalearning for semi-supervised few-shot classification. arXiv preprint arXiv:1803.00676.
    Findings
  • Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and TieYan Liu. 2019. MASS: Masked sequence to sequence pre-training for language generation. arXiv preprint arXiv:1905.02450.
    Findings
  • Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, and Hua Wu. 2019a. ERNIE: Enhanced Representation through Knowledge Integration. arXiv preprint arXiv:1904.09223.
    Findings
  • Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Hao Tian, Hua Wu, and Haifeng Wang. 2019b. ERNIE 2.0: A continual pre-training framework for language understanding. arXiv preprint arXiv:1907.12412.
    Findings
  • Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
    Google ScholarLocate open access versionFindings
  • Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019a. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. arXiv preprint 1905.00537.
    Findings
  • Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2018. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. ArXiv preprint 1804.07461.
    Findings
  • Alex Wang, Ian F. Tenney, Yada Pruksachatkun, Katherin Yu, Jan Hula, Patrick Xia, Raghu Pappagari, Shuning Jin, R. Thomas McCoy, Roma Patel, Yinghui Huang, Jason Phang, Edouard Grave, Haokun Liu, Najoung Kim, Phu Mon Htut, Thibault F’evry, Berlin Chen, Nikita Nangia, Anhad Mohananey, Katharina Kann, Shikha Bordia, Nicolas Patry, David Benton, Ellie Pavlick, and Samuel R. Bowman. 2019b. jiant 1.2: A software toolkit for research on general-purpose text understanding models. http://jiant.info/.
    Findings
  • Wei Wang, Bin Bi, Ming Yan, Chen Wu, Jiangnan Xia, Zuyi Bao, Liwei Peng, and Luo Si. 2020. StructBERT: Incorporating Language Structures into Pretraining for Deep Language Understanding. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
    Findings
  • Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv preprint arXiv:1906.08237.
    Findings
  • Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. 2018. SWAG: A large-scale adversarial dataset for grounded commonsense inference. arXiv preprint arXiv:1808.05326.
    Findings
  • Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books. In The IEEE International Conference on Computer Vision (ICCV).
    Google ScholarLocate open access versionFindings
  • 2. Calculate the token chunk size, C = T /(N ∗ (N + 1)), where T is the total number of training tokens.
    Google ScholarFindings
  • 3. Each stage, Si, a new task is introduced. During that stage the new task is trained on C ∗(i+1) tokens, previously introduced tasks are trained on C tokens, and yet to be introduced tasks are trained on 0 tokens. The method can use iterations or tokens. The above method trains on each task using the same number of tokens/iterations, gradually incorporating more tasks, while still training on previous tasks. Below we provide two examples. The first from (Sun et al., 2019b) which uses four tasks and 200k iterations, the second from our final combined model which uses three tasks (MLM not included) and 10 billion tokens.
    Google ScholarLocate open access versionFindings
作者
Stephane Aroca-Ouellette
Stephane Aroca-Ouellette
Frank Rudzicz
Frank Rudzicz
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科