RoBERTa: A Robustly Optimized BERT Pretraining Approach

Cited by: 461|Bibtex|Views117
Other Links: arxiv.org
Weibo:
We evaluate a number of design decisions when pretraining BERT models, demonstrating that performance can be substantially improved by training the model longer, with bigger batches over more data; removing the sentence prediction objective; training on longer sequences; and dyna...

Abstract:

Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have significant impact on the final results. We present a replicat...More

Code:

Data:

Introduction
  • Self-training methods such as ELMo (Peters et al, 2018), GPT (Radford et al, 2018), BERT (Devlin et al, 2019), XLM (Lample & Conneau, 2019), and XLNet (Yang et al, 2019) have brought significant performance gains, but it can be challenging to determine which aspects of the methods contribute the most.
  • The authors present a replication study of BERT pretraining (Devlin et al, 2019), which includes a careful evaluation of the effects of hyperparameter tuning and training set size.
  • Setup: BERT (Devlin et al, 2019) takes as input a concatenation of two segments, x1, .
  • The two segments are presented as a single input sequence to BERT with special tokens delimiting them: [CLS ], x1, .
  • Each block has A self-attention heads and hidden dimension H
Highlights
  • Self-training methods such as ELMo (Peters et al, 2018), GPT (Radford et al, 2018), BERT (Devlin et al, 2019), XLM (Lample & Conneau, 2019), and XLNet (Yang et al, 2019) have brought significant performance gains, but it can be challenging to determine which aspects of the methods contribute the most
  • We find that BERT was significantly undertrained and propose an improved training recipe, which we call RoBERTa, that can match or exceed the performance of all of the post-BERT methods. They include: (1) training the model longer, with bigger batches, over more data; (2) removing the sentence prediction objective; (3) training on longer sequences; and (4) dynamically changing the masking pattern applied to the training data
  • The contributions of this paper are: (1) We present a set of important BERT design choices and training strategies and introduce alternatives that lead to better downstream task performance; (2) We use a novel dataset, collect a large new dataset, and confirm that using more data for pretraining further improves performance on downstream tasks; (3) Our training improvements show that masked language model pretraining, under the right design choices, is competitive with all other recently published methods
  • In the second setting, we submit RoBERTa to the General Language Understanding Evaluation leaderboard and achieve state-of-the-art results on 4 out of 9 tasks and the highest average score to date
  • On the Stanford Question Answering Dataset v1.1 development set, RoBERTa matches the state-of-the-art set by XLNet
  • We evaluate a number of design decisions when pretraining BERT models, demonstrating that performance can be substantially improved by training the model longer, with bigger batches over more data; removing the sentence prediction objective; training on longer sequences; and dynamically changing the masking pattern applied to the training data
Methods
  • 3.1 IMPLEMENTATION

    The authors reimplement BERT in FAIRSEQ (Ott et al, 2019). The authors primarily follow the original BERT optimization hyperparameters, given in Section 2, except for the peak learning rate and number of warmup steps, which are tuned separately for each setting.
  • The original BERT implementation performed masking once during data preprocessing, resulting in a single static mask.
  • The authors instead train with dynamic masking, where the authors generate the masking pattern on-the-fly each time the authors input a sequence to the model.
  • This becomes crucial when pretraining for more steps or with larger datasets, and performs marginally better than static masking on some downstream tasks
Results
  • The authors compare training without the NSP loss and training with blocks of text from a single document (DOC-SENTENCES)
  • The authors find that this setting outperforms the originally published BERTBASE results and that removing the NSP loss matches or slightly improves downstream task performance, in contrast to Devlin et al (2019).
  • RoBERTa achieves stateof-the-art results on the development and test sets for BoolQ, CB, COPA, MultiRC and ReCoRD and the highest average score to date on the SuperGLUE leaderboard.
Conclusion
  • The authors evaluate a number of design decisions when pretraining BERT models, demonstrating that performance can be substantially improved by training the model longer, with bigger batches over more data; removing the sentence prediction objective; training on longer sequences; and dynamically changing the masking pattern applied to the training data.
  • The authors use a novel dataset, CC-NEWS, and release the models and code for pretraining and finetuning at: anonymous URL.
  • The authors' improved pretraining procedure, which the authors call RoBERTa, achieves state-of-the-art results on GLUE, RACE, SQuAD, SuperGLUE and XNLI.
  • These results illustrate the importance of these previously overlooked design decisions and suggest that BERT’s pretraining objective remains competitive with recently proposed alternatives
Summary
  • Introduction:

    Self-training methods such as ELMo (Peters et al, 2018), GPT (Radford et al, 2018), BERT (Devlin et al, 2019), XLM (Lample & Conneau, 2019), and XLNet (Yang et al, 2019) have brought significant performance gains, but it can be challenging to determine which aspects of the methods contribute the most.
  • The authors present a replication study of BERT pretraining (Devlin et al, 2019), which includes a careful evaluation of the effects of hyperparameter tuning and training set size.
  • Setup: BERT (Devlin et al, 2019) takes as input a concatenation of two segments, x1, .
  • The two segments are presented as a single input sequence to BERT with special tokens delimiting them: [CLS ], x1, .
  • Each block has A self-attention heads and hidden dimension H
  • Methods:

    3.1 IMPLEMENTATION

    The authors reimplement BERT in FAIRSEQ (Ott et al, 2019). The authors primarily follow the original BERT optimization hyperparameters, given in Section 2, except for the peak learning rate and number of warmup steps, which are tuned separately for each setting.
  • The original BERT implementation performed masking once during data preprocessing, resulting in a single static mask.
  • The authors instead train with dynamic masking, where the authors generate the masking pattern on-the-fly each time the authors input a sequence to the model.
  • This becomes crucial when pretraining for more steps or with larger datasets, and performs marginally better than static masking on some downstream tasks
  • Results:

    The authors compare training without the NSP loss and training with blocks of text from a single document (DOC-SENTENCES)
  • The authors find that this setting outperforms the originally published BERTBASE results and that removing the NSP loss matches or slightly improves downstream task performance, in contrast to Devlin et al (2019).
  • RoBERTa achieves stateof-the-art results on the development and test sets for BoolQ, CB, COPA, MultiRC and ReCoRD and the highest average score to date on the SuperGLUE leaderboard.
  • Conclusion:

    The authors evaluate a number of design decisions when pretraining BERT models, demonstrating that performance can be substantially improved by training the model longer, with bigger batches over more data; removing the sentence prediction objective; training on longer sequences; and dynamically changing the masking pattern applied to the training data.
  • The authors use a novel dataset, CC-NEWS, and release the models and code for pretraining and finetuning at: anonymous URL.
  • The authors' improved pretraining procedure, which the authors call RoBERTa, achieves state-of-the-art results on GLUE, RACE, SQuAD, SuperGLUE and XNLI.
  • These results illustrate the importance of these previously overlooked design decisions and suggest that BERT’s pretraining objective remains competitive with recently proposed alternatives
Tables
  • Table1: Development set results for base models pretrained over BOOKCORPUS and WIKIPEDIA. All models are trained for 1M steps with a batch size of 256 sequences. We report F1 for SQuAD and accuracy for MNLI-m, SST-2 and RACE. Reported results are medians over five random initializations (seeds). Results for BERTBASE and XLNetBASE are from <a class="ref-link" id="cYang_et+al_2019_a" href="#rYang_et+al_2019_a">Yang et al (2019</a>)
  • Table2: Perplexity on held-out validation data and dev set accuracy on MNLI-m and SST-2 for various batch sizes (# sequences) as we vary the number of passes (epochs) through the BOOKS + WIKI data. Reported results are medians over five random initializations (seeds). The learning rate is tuned for each batch size. All results are for BERTBASE with FULL-SENTENCE inputs
  • Table3: Development set results for RoBERTa as we pretrain over more data (16GB → 160GB of text) and pretrain for longer (100K → 300K → 500K steps). Each row accumulates improvements from the rows above. RoBERTa matches the architecture and training objective of BERTLARGE. Results for BERTLARGE and XLNetLARGE are from <a class="ref-link" id="cDevlin_et+al_2019_a" href="#rDevlin_et+al_2019_a">Devlin et al (2019</a>) and <a class="ref-link" id="cYang_et+al_2019_a" href="#rYang_et+al_2019_a">Yang et al (2019</a>), respectively. Complete results on all GLUE tasks can be found in Appendix C
  • Table4: Results on GLUE. All results are based on a 24-layer architecture. BERTLARGE and XLNetLARGE results are from <a class="ref-link" id="cDevlin_et+al_2019_a" href="#rDevlin_et+al_2019_a">Devlin et al (2019</a>) and <a class="ref-link" id="cYang_et+al_2019_a" href="#rYang_et+al_2019_a">Yang et al (2019</a>), respectively. RoBERTa results on the dev set are a median over five runs. RoBERTa results on the test set are ensembles of single-task models. For RTE, STS and MRPC we finetune starting from the MNLI model
  • Table5: Results on SQuAD. † indicates results that depend on additional external training data. RoBERTa uses only the provided SQuAD data in both dev and test settings. BERTLARGE and XLNetLARGE results are from <a class="ref-link" id="cDevlin_et+al_2019_a" href="#rDevlin_et+al_2019_a">Devlin et al (2019</a>) and <a class="ref-link" id="cYang_et+al_2019_a" href="#rYang_et+al_2019_a">Yang et al (2019</a>), respectively
  • Table6: Results on the RACE test set. BERTLARGE and XLNetLARGE results from <a class="ref-link" id="cYang_et+al_2019_a" href="#rYang_et+al_2019_a">Yang et al (2019</a>)
  • Table7: Comparison between the published BERTBASE results from <a class="ref-link" id="cDevlin_et+al_2019_a" href="#rDevlin_et+al_2019_a">Devlin et al (2019</a>) to our reimplementation with either static or dynamic masking. We report F1 for SQuAD and accuracy for MNLI-m and SST-2. Reported results are medians over 5 random initializations (seeds). Reference results are from <a class="ref-link" id="cYang_et+al_2019_a" href="#rYang_et+al_2019_a">Yang et al (2019</a>). We find that our reimplementation with static masking performs similar to the original BERT model, and dynamic masking is comparable or slightly better than static masking
  • Table8: Development set results on GLUE tasks for various configurations of RoBERTa. All results are a median over five runs
  • Table9: Hyperparameters for pretraining RoBERTaLARGE and RoBERTaBASE
  • Table10: Hyperparameters for finetuning RoBERTaLARGE on RACE, SQuAD and GLUE. We select the best hyperparameter values based on the median of 5 random seeds for each task
  • Table11: Results on SuperGLUE. All results are based on a 24-layer architecture. RoBERTa results on the development set are a median over five runs. RoBERTa results on the test set are ensembles of single-task models. Averages are obtained from the SuperGLUE leaderboard
  • Table12: Results on XNLI (<a class="ref-link" id="cConneau_et+al_2018_a" href="#rConneau_et+al_2018_a">Conneau et al, 2018</a>) for RoBERTaLARGE in the TRANSLATE-TEST setting. We report macro-averaged accuracy (∆) using the provided English translations of the XNLI test sets. RoBERTa achieves state of the art results on all 15 languages
Download tables as Excel
Related work
Reference
  • Eneko Agirre, Lluıs Marquez, and Richard Wicentowski (eds.). Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007). 2007.
    Google ScholarLocate open access versionFindings
  • Alexei Baevski, Sergey Edunov, Yinhan Liu, Luke Zettlemoyer, and Michael Auli. Cloze-driven pretraining of self-attention networks. arXiv preprint arXiv:1903.07785, 2019.
    Findings
  • Roy Bar Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, and Idan Szpektor. The second PASCAL recognising textual entailment challenge. 2006.
    Google ScholarFindings
  • Luisa Bentivogli, Ido Dagan, Hoa Trang Dang, Danilo Giampiccolo, and Bernardo Magnini. The fifth PASCAL recognizing textual entailment challenge. 2009.
    Google ScholarFindings
  • Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. A large annotated corpus for learning natural language inference. In Empirical Methods in Natural Language Processing (EMNLP), 2015.
    Google ScholarLocate open access versionFindings
  • William Chan, Nikita Kitaev, Kelvin Guu, Mitchell Stern, and Jakob Uszkoreit. KERMIT: Generative insertion-based modeling for sequences. arXiv preprint arXiv:1906.01604, 2019.
    Findings
  • Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of NAACL-HLT 2019, 2019.
    Google ScholarLocate open access versionFindings
  • Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel R. Bowman, Holger Schwenk, and Veselin Stoyanov. Xnli: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2018.
    Google ScholarLocate open access versionFindings
  • Andrew M Dai and Quoc V Le. Semi-supervised sequence learning. In Advances in Neural Information Processing Systems (NIPS), 2015.
    Google ScholarLocate open access versionFindings
  • Marie-Catherine De Marneffe, Mandy Simons, and Judith Tonhauser. The CommitmentBank: Investigating projection in naturally occurring discourse. 2019. To appear in proceedings of Sinn und Bedeutung 23. Data can be found at https://github.com/mcdm/CommitmentBank/.
    Locate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In North American Association for Computational Linguistics (NAACL), 2019.
    Google ScholarLocate open access versionFindings
  • William B Dolan and Chris Brockett. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the International Workshop on Paraphrasing, 2005.
    Google ScholarLocate open access versionFindings
  • Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. Unified language model pre-training for natural language understanding and generation. arXiv preprint arXiv:1905.03197, 2019.
    Findings
  • Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. The third PASCAL recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing, pp. 1–9. Association for Computational Linguistics, 2007.
    Google ScholarLocate open access versionFindings
  • Aaron Gokaslan and Vanya Cohen. Openwebtext corpus. http://web.archive.org/save/http://Skylion007.github.io/OpenWebTextCorpus, 2019.
    Findings
  • Felix Hamborg, Norman Meuschke, Corinna Breitinger, and Bela Gipp. news-please: A generic news crawler and extractor. In Proceedings of the 15th International Symposium of Information Science, 2017.
    Google ScholarLocate open access versionFindings
  • Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
    Findings
  • Matthew Honnibal and Ines Montani. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear, 2017.
    Google ScholarFindings
  • Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146, 2018.
    Findings
  • Shankar Iyer, Nikhil Dandekar, and Kornl Csernai. First quora dataset release: Question pairs. https://data.quora.com/First-Quora-Dataset-Release-QuestionPairs, 2016.
    Findings
  • Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. SpanBERT: Improving pre-training by representing and predicting spans. arXiv preprint arXiv:1907.10529, 2019.
    Findings
  • Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 252–262, 2018.
    Google ScholarLocate open access versionFindings
  • Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
    Google ScholarLocate open access versionFindings
  • Vid Kocijan, Ana-Maria Cretu, Oana-Maria Camburu, Yordan Yordanov, and Thomas Lukasiewicz. A surprisingly robust trick for winograd schema challenge. arXiv preprint arXiv:1905.06290, 2019.
    Findings
  • Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683, 2017.
    Findings
  • Guillaume Lample and Alexis Conneau. Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291, 2019.
    Findings
  • Hector J Levesque, Ernest Davis, and Leora Morgenstern. The Winograd schema challenge. In AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning, volume 46, pp. 47, 2011.
    Google ScholarLocate open access versionFindings
  • Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Improving multi-task deep neural networks via knowledge distillation for natural language understanding. arXiv preprint arXiv:1904.09482, 2019a.
    Findings
  • Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Multi-task deep neural networks for natural language understanding. arXiv preprint arXiv:1901.11504, 2019b.
    Findings
  • Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id= Bkg6RiCqY7.
    Locate open access versionFindings
  • Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems (NIPS), pp. 6297–6308, 2017.
    Google ScholarLocate open access versionFindings
  • Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed precision training. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • Sebastian Nagel. Cc-news. http://web.archive.org/save/http://commoncrawl.org/2016/10/news-dataset-available, 2016.
    Findings
  • Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. Scaling neural machine translation. In Proceedings of the Third Conference on Machine Translation (WMT), 2018.
    Google ScholarLocate open access versionFindings
  • Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. FAIRSEQ: A fast, extensible toolkit for sequence modeling. In North American Association for Computational Linguistics (NAACL): System Demonstrations, 2019.
    Google ScholarLocate open access versionFindings
  • Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In North American Association for Computational Linguistics (NAACL), 2018.
    Google ScholarLocate open access versionFindings
  • Jason Phang, Thibault Fvry, and Samuel R. Bowman. Sentence encoders on stilts: Supplementary training on intermediate labeled-data tasks. arXiv preprint arXiv:1811.01088, 2018.
    Findings
  • Mohammad Taher Pilehvar and Jose Camacho-Collados. WiC: The word-in-context dataset for evaluating context-sensitive meaning representations. In Proceedings of NAACL-HLT, 2019.
    Google ScholarLocate open access versionFindings
  • Adam Poliak, Aparajita Haldar, Rachel Rudinger, J. Edward Hu, Ellie Pavlick, Aaron Steven White, and Benjamin Van Durme. Collecting diverse natural language inference problems for sentence representation evaluation. In Proceedings of EMNLP, 2018.
    Google ScholarLocate open access versionFindings
  • Alec Radford, Karthik Narasimhan, Time Salimans, and Ilya Sutskever. Improving language understanding with unsupervised learning. Technical report, OpenAI, 2018.
    Google ScholarFindings
  • Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. Technical report, OpenAI, 2019.
    Google ScholarFindings
  • Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for machine comprehension of text. In Empirical Methods in Natural Language Processing (EMNLP), 2016.
    Google ScholarLocate open access versionFindings
  • Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions for squad. In Association for Computational Linguistics (ACL), 2018.
    Google ScholarLocate open access versionFindings
  • Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S. Gordon. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series, 2011.
    Google ScholarLocate open access versionFindings
  • Rachel Rudinger, Jason Naradowsky, Brian Leonard, and Benjamin Van Durme. Gender bias in coreference resolution. In Proceedings of NAACL-HLT, 2018.
    Google ScholarLocate open access versionFindings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In Association for Computational Linguistics (ACL), pp. 1715–1725, 2016.
    Google ScholarLocate open access versionFindings
  • Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Empirical Methods in Natural Language Processing (EMNLP), 2013.
    Google ScholarLocate open access versionFindings
  • Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. MASS: Masked sequence to sequence pre-training for language generation. In International Conference on Machine Learning (ICML), 2019.
    Google ScholarLocate open access versionFindings
  • Yu Stephanie Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xinlun Tian, Danxiang Zhu, Hao Tian, and Hua Wu. ERNIE: Enhanced representation through knowledge integration. arXiv preprint arXiv:1904.09223, 2019.
    Findings
  • Trieu H Trinh and Quoc V Le. A simple method for commonsense reasoning. arXiv preprint arXiv:1806.02847, 2018.
    Findings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, 2017.
    Google ScholarLocate open access versionFindings
  • Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. SuperGLUE: A stickier benchmark for general-purpose language understanding systems. arXiv preprint 1905.00537, 2019a.
    Findings
  • Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations (ICLR), 2019b.
    Google ScholarLocate open access versionFindings
  • Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman. Neural network acceptability judgments. arXiv preprint 1805.12471, 2018.
    Findings
  • Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In North American Association for Computational Linguistics (NAACL), 2018.
    Google ScholarLocate open access versionFindings
  • Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237, 2019.
    Findings
  • Yang You, Jing Li, Jonathan Hseu, Xiaodan Song, James Demmel, and Cho-Jui Hsieh. Reducing bert pre-training time from 3 days to 76 minutes. arXiv preprint arXiv:1904.00962, 2019.
    Findings
  • Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. Defending against neural fake news. arXiv preprint arXiv:1905.12616, 2019.
    Findings
  • Sheng Zhang, Xiaodong Liu, Jingjing Liu, Jianfeng Gao, Kevin Duh, and Benjamin Van Durme. ReCoRD: Bridging the gap between human and machine commonsense reading comprehension. arXiv preprint 1810.12885, 2018.
    Findings
  • Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In arXiv preprint arXiv:1506.06724, 2015.
    Findings
Full Text
Your rating :
0

 

Tags
Comments