AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
It can be seen that ERNIE 1.0BASE outperforms BERTBASE on XNLI, MSRA-Named Entity Recognition, ChnSentiCorp, LCQMC and NLPCC-DBQA tasks, yet the performance is less ideal on the rest, which is caused by the difference in pre-training between the two methods

ERNIE 2.0: A Continual Pre-training Framework for Language Understanding

national conference on artificial intelligence, (2020)

被引用29|浏览167
下载 PDF 全文
引用
微博一下

摘要

Recently, pre-trained models have achieved state-of-the-art results in various language understanding tasks, which indicates that pre-training on large-scale corpora may play a crucial role in natural language processing. Current pre-training procedures usually focus on training the model with several simple tasks to grasp the co-occurr...更多

代码

数据

0
简介
  • Pre-trained language representations such as ELMo[1], OpenAI GPT[2], BERT [3], ERNIE 1.0 [4]1 and XLNet[5] have been proven to be effective for improving the performances of various natural language understanding tasks including sentiment classification [6], natural language inference [7], named entity recognition [8] and so on.
  • In order to discover all valuable information in training corpora, be it lexical, syntactic or semantic representations, the authors propose a continual pre-training framework named ERNIE 2.0 which could incrementally build and train a large variety of pre-training tasks through constant multi-task learning.
  • The authors' ERNIE framework supports the introduction of various customized tasks at any time
  • These tasks share the same encoding networks and are trained through multi-task learning.
重点内容
  • Pre-trained language representations such as ELMo[1], OpenAI GPT[2], BERT [3], ERNIE 1.0 [4]1 and XLNet[5] have been proven to be effective for improving the performances of various natural language understanding tasks including sentiment classification [6], natural language inference [7], named entity recognition [8] and so on
  • In order to discover all valuable information in training corpora, be it lexical, syntactic or semantic representations, we propose a continual pre-training framework named ERNIE 2.0 which could incrementally build and train a large variety of pre-training tasks through constant multi-task learning
  • 5.4 Experimental Results 5.4.1 Results on English Tasks In order to ensure the integrity of the experiments, we evaluate the performance of the base models and the large models of each comparison method on General Language Understanding Evaluation (GLUE)
  • It can be seen that ERNIE 1.0BASE outperforms BERTBASE on XNLI, MSRA-Named Entity Recognition (NER), ChnSentiCorp, LCQMC and NLPCC-DBQA tasks, yet the performance is less ideal on the rest, which is caused by the difference in pre-training between the two methods
  • We constructed several pre-training tasks covering different aspects of language and trained a new model called ERNIE 2.0 model which is more competent in language representation
  • ERNIE 2.0 was tested on the GLUE benchmarks and various Chinese tasks
方法
  • The authors compare the performance of ERNIE 2.0 with the state-of-the-art pre-training models.
  • For English tasks, the authors compare the results with BERT[3] and XLNet[5] on GLUE.
  • For Chinese tasks, the authors compare the results with that of BERT[3] and the previous ERNIE 1.0[4] model on several Chinese datasets.
  • For the Chinese corpus, the authors collect a variety of data, such as encyclopedia, news, dialogue, information retrieval and discourse relation data from Baidu Search Engine.
结果
  • 5.4.1 Results on English Tasks In order to ensure the integrity of the experiments, the authors evaluate the performance of the base models and the large models of each comparison method on GLUE.
  • As shown in the BASE model columns of Table 6, ERNIE 2.0BASE outperforms BERTBASE on all of the 10 tasks and obtains a score of 80.6.
  • It can be seen that ERNIE 1.0BASE outperforms BERTBASE on XNLI, MSRA-NER, ChnSentiCorp, LCQMC and NLPCC-DBQA tasks, yet the performance is less ideal on the rest, which is caused by the difference in pre-training between the two methods.
  • ERNIE 2.0LARGE yields improvements of more than 2 points over BERTBASE on the CMRC 2018, DRCD, DuReader, XNLI, MSRA-NER and LCQMC tasks, and yields improvements of more than 2 points over ERNIEBASE on the CMRC 2018, DRCD, DuReader and XNLI tasks
结论
  • The authors proposed a continual pre-training framework named ERNIE 2.0, in which pre-training tasks can be incrementally built and learned through multi-task learning in a continual way.
  • The authors constructed several pre-training tasks covering different aspects of language and trained a new model called ERNIE 2.0 model which is more competent in language representation.
  • ERNIE 2.0 was tested on the GLUE benchmarks and various Chinese tasks.
  • The authors will introduce more pre-training tasks to the ERNIE 2.0 framework to further improve the performance of the model
总结
  • Introduction:

    Pre-trained language representations such as ELMo[1], OpenAI GPT[2], BERT [3], ERNIE 1.0 [4]1 and XLNet[5] have been proven to be effective for improving the performances of various natural language understanding tasks including sentiment classification [6], natural language inference [7], named entity recognition [8] and so on.
  • In order to discover all valuable information in training corpora, be it lexical, syntactic or semantic representations, the authors propose a continual pre-training framework named ERNIE 2.0 which could incrementally build and train a large variety of pre-training tasks through constant multi-task learning.
  • The authors' ERNIE framework supports the introduction of various customized tasks at any time
  • These tasks share the same encoding networks and are trained through multi-task learning.
  • Methods:

    The authors compare the performance of ERNIE 2.0 with the state-of-the-art pre-training models.
  • For English tasks, the authors compare the results with BERT[3] and XLNet[5] on GLUE.
  • For Chinese tasks, the authors compare the results with that of BERT[3] and the previous ERNIE 1.0[4] model on several Chinese datasets.
  • For the Chinese corpus, the authors collect a variety of data, such as encyclopedia, news, dialogue, information retrieval and discourse relation data from Baidu Search Engine.
  • Results:

    5.4.1 Results on English Tasks In order to ensure the integrity of the experiments, the authors evaluate the performance of the base models and the large models of each comparison method on GLUE.
  • As shown in the BASE model columns of Table 6, ERNIE 2.0BASE outperforms BERTBASE on all of the 10 tasks and obtains a score of 80.6.
  • It can be seen that ERNIE 1.0BASE outperforms BERTBASE on XNLI, MSRA-NER, ChnSentiCorp, LCQMC and NLPCC-DBQA tasks, yet the performance is less ideal on the rest, which is caused by the difference in pre-training between the two methods.
  • ERNIE 2.0LARGE yields improvements of more than 2 points over BERTBASE on the CMRC 2018, DRCD, DuReader, XNLI, MSRA-NER and LCQMC tasks, and yields improvements of more than 2 points over ERNIEBASE on the CMRC 2018, DRCD, DuReader and XNLI tasks
  • Conclusion:

    The authors proposed a continual pre-training framework named ERNIE 2.0, in which pre-training tasks can be incrementally built and learned through multi-task learning in a continual way.
  • The authors constructed several pre-training tasks covering different aspects of language and trained a new model called ERNIE 2.0 model which is more competent in language representation.
  • ERNIE 2.0 was tested on the GLUE benchmarks and various Chinese tasks.
  • The authors will introduce more pre-training tasks to the ERNIE 2.0 framework to further improve the performance of the model
表格
  • Table1: The size of pre-training datasets
  • Table2: The details of GLUE benchmark. The #Train, #Dev and #Test denote the size of the training set, development set and test set of corresponding corpus respectively. The #label denotes the size of the label set of the corresponding corpus
  • Table3: The details of Chinese NLP datasets. The #Train, #Dev and #Test denote the size of the training set, development set and test set of corresponding corpus respectively. The #label denotes the size of the label set of the corresponding corpus
  • Table4: The Experiment settings for GLUE dataset
  • Table5: The Experiment Settings for Chinese datasets
  • Table6: The results on GLUE benchmark, where the results on dev set are the median of five experimental results and the results on test set are scored by the GLUE evaluation server (https://gluebenchmark.com/leaderboard). The state-of-the-art results are in bold. All of the fine-tuned models of AX is trained by the data of MNLI
  • Table7: The results of 9 common Chinese NLP tasks. ERNIE 1.0 indicates our previous model ERNIE[<a class="ref-link" id="c4" href="#r4">4</a>]. The reported results are the average of five experimental results, and the state-of-the-art results are in bold
Download tables as Excel
相关工作
  • 2.1 Unsupervised Transfer Learning for Language Representation

    It is effective to learn general language representation by pre-training a language model with a large amount of unannotated data. Traditional methods usually focus on context-independent word embedding. Methods such as Word2Vec[9] and GloVe[10] learn fixed word embeddings through word co-occurrence on large corpora. Recently, several studies centered on contextualized language representations have been proposed and context-dependent language representations have shown state-of-the-art results in various natural language processing tasks. ELMo[1] proposes to extract context-sensitive features from a language model. OpenAI GPT[2] enhances the context-sensitive embedding by adjusting the Transformer[11]. BERT[3], however, adopts a masked language model while adding a next sentence prediction task into the pre-training. XLM[12] integrates two methods to learn cross-lingual language models, namely the unsupervised method that relies only on monolingual data and supervised method that leverages parallel bilingual data. MT-DNN[13] achieves a better result through learning several supervised tasks in GLUE[14] together based on the pre-trained model, which eventually leads to improvements on other supervised tasks that are not learned in the stage of multi-task supervised fine-tuning. XLNet[5] uses Transformer-XL[15] and proposes a generalized autoregressive pre-training method that learns bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order.
基金
  • ERNIE 2.0LARGE gets a score of 83.6 on the GLUE test set and achieves a 3.1% improvement over the previous SOTA pre-training model BERTLARGE
研究对象与分析
sentence pairs with data extracted from Quora QA community and: 400000
• WNLI: Winograd Natural Language Inference (WNLI)[22] is a corpus that captures the coreference information between two paragraphs. • QQP: Quora Question Pairs (QQP)3 consists of over 400,000 sentence pairs with data extracted from Quora QA community and is commonly used in tasks for judging whether the question pairs are duplicates or not. • MRPC: Microsoft Research Paraphrase Corpus (MRPC)[23] contains 5800 pairs of sentences extracted from news on the Internet and is annotated to capture the equivalence of paraphrase or semantic relationship between a pair of sentences

pairs: 5800
• QQP: Quora Question Pairs (QQP)3 consists of over 400,000 sentence pairs with data extracted from Quora QA community and is commonly used in tasks for judging whether the question pairs are duplicates or not. • MRPC: Microsoft Research Paraphrase Corpus (MRPC)[23] contains 5800 pairs of sentences extracted from news on the Internet and is annotated to capture the equivalence of paraphrase or semantic relationship between a pair of sentences. MRPC is commonly used in similar tasks as QQP

引用论文
  • Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. arXiv preprint arXiv:1802.05365, 2018.
    Findings
  • Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. URL https://s3-us-west-2.amazonaws.com/openai-assets/researchcovers/languageunsupervised/language understanding paper.pdf, 2018.
    Findings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
    Findings
  • Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, and Hua Wu. Ernie: Enhanced representation through knowledge integration. arXiv preprint arXiv:1904.09223, 2019.
    Findings
  • Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237, 2019.
    Findings
  • Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642, 2013.
    Google ScholarLocate open access versionFindings
  • Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326, 2015.
    Findings
  • Erik F Sang and Fien De Meulder. Introduction to the conll-2003 shared task: Language-independent named entity recognition. arXiv preprint cs/0306050, 2003.
    Google ScholarFindings
  • Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
    Findings
  • Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
    Google ScholarLocate open access versionFindings
  • Guillaume Lample and Alexis Conneau. Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291, 2019.
    Findings
  • Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Multi-task deep neural networks for natural language understanding. arXiv preprint arXiv:1901.11504, 2019.
    Findings
  • Alex Wang, Amapreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
    Findings
  • Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
    Findings
  • German I Parisi, Ronald Kemker, Jose L Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review. Neural Networks, 2019.
    Google ScholarLocate open access versionFindings
  • Zhiyuan Chen and Bing Liu. Lifelong machine learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 12(3):1–207, 2018.
    Google ScholarLocate open access versionFindings
  • Damien Sileo, Tim Van-De-Cruys, Camille Pradel, and Philippe Muller. Mining discourse markers for unsupervised sentence representation learning. arXiv preprint arXiv:1903.11850, 2019.
    Findings
  • Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. Neural network acceptability judgments. arXiv preprint arXiv:1805.12471, 2018.
    Findings
  • Adina Williams, Nikita Nangia, and Samuel R Bowman. A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426, 2017.
    Findings
  • Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo Giampiccolo. The fifth pascal recognizing textual entailment challenge. In TAC, 2009.
    Google ScholarLocate open access versionFindings
  • Hector Levesque, Ernest Davis, and Leora Morgenstern. The winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning, 2012.
    Google ScholarLocate open access versionFindings
  • William B Dolan and Chris Brockett. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005), 2005.
    Google ScholarLocate open access versionFindings
  • Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055, 2017.
    Findings
  • Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
    Findings
  • Wei Wang, Ming Yan, and Chen Wu. Multi-granularity hierarchical attention fusion networks for reading comprehension and question answering. arXiv preprint arXiv:1811.11934, 2018.
    Findings
  • Yiming Cui, Ting Liu, Li Xiao, Zhipeng Chen, Wentao Ma, Wanxiang Che, Shijin Wang, and Guoping Hu. A span-extraction dataset for chinese machine reading comprehension. CoRR, abs/1810.07366, 2018.
    Findings
  • Chih Chieh Shao, Trois Liu, Yuting Lai, Yiying Tseng, and Sam Tsai. Drcd: a chinese machine reading comprehension dataset. arXiv preprint arXiv:1806.00920, 2018.
    Findings
  • Wei He, Kai Liu, Jing Liu, Yajuan Lyu, Shiqi Zhao, Xinyan Xiao, Yuan Liu, Yizhong Wang, Hua Wu, Qiaoqiao She, et al. Dureader: a chinese machine reading comprehension dataset from real-world applications. arXiv preprint arXiv:1711.05073, 2017.
    Findings
  • Gina-Anne Levow. The third international chinese language processing bakeoff: Word segmentation and named entity recognition. In Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, pages 108–117, 2006.
    Google ScholarLocate open access versionFindings
  • Alexis Conneau, Guillaume Lample, Ruty Rinott, Adina Williams, Samuel R Bowman, Holger Schwenk, and Veselin Stoyanov. Xnli: Evaluating cross-lingual sentence representations. arXiv preprint arXiv:1809.05053, 2018.
    Findings
  • Xin Liu, Qingcai Chen, Chong Deng, Huajun Zeng, Jing Chen, Dongfang Li, and Buzhou Tang. Lcqmc: A largescale chinese question matching corpus. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1952–1962, 2018.
    Google ScholarLocate open access versionFindings
  • Jing Chen, Qingcai Chen, Xin Liu, Haijun Yang, Daohe Lu, and Buzhou Tang. The bq corpus: A large-scale domain-specific chinese corpus for sentence semantic equivalence identification. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4946–4951, 2018.
    Google ScholarLocate open access versionFindings
作者
Yu Sun
Yu Sun
Shuohuan Wang
Shuohuan Wang
Yukun Li
Yukun Li
Shikun Feng
Shikun Feng
Hao Tian
Hao Tian
您的评分 :
0

 

标签
评论
小科