ERNIE-GEN: An Enhanced Multi-Flow Pre-training and Fine-tuning Framework for Natural Language Generation

Xiao Dongling
Xiao Dongling
Zhang Han
Zhang Han
Li Yukun
Li Yukun
Sun Yu
Sun Yu
Tian Hao
Tian Hao

IJCAI, pp. 3997-4003, 2020.

Cited by: 0|Bibtex|Views23|Links
EI
Keywords:
training workaware generationernie genspan generationpre trainingMore(20+)
Weibo:
We present an enhanced multi-flow seq2seq pre-training and fine-tuning framework for language generation, which incorporates a infilling generation mechanism and a noise-aware generation method to alleviate the exposure bias

Abstract:

Current pre-training works in natural language generation pay little attention to the problem of exposure bias on downstream tasks. To address this issue, we propose an enhanced multi-flow sequence to sequence pre-training and fine-tuning framework named ERNIE-GEN, which bridges the discrepancy between training and inference with an inf...More

Code:

Data:

0
Introduction
  • Pre-trained on large-scale unlabeled text corpora and finetuned on downstream tasks, self-supervised representation models such as GPT [Radford et al, 2018], BERT [Devlin et al, 2019] and XLNet [Yang et al, 2019b] have achieved remarkable improvements in natural language understanding (NLU).
  • Different from encoder-only pre-training like BERT or decoder-only pre-training like GPT, natural language generation (NLG) relies on the sequence to sequence generation framework which consists of a bidirectional encoder and a unidirectional decoder.
  • Current pre-training works in NLG such as MASS [Song et al, 2019] and UNILM [Dong et al, 2019] mainly focus on jointly pre-training encoder and decoder on different self-supervised tasks.
  • 1_n t3 Tokens (a) Typical generation mechanism t1 \\ Source t1 [A].
Highlights
  • Pre-trained on large-scale unlabeled text corpora and finetuned on downstream tasks, self-supervised representation models such as GPT [Radford et al, 2018], BERT [Devlin et al, 2019] and XLNet [Yang et al, 2019b] have achieved remarkable improvements in natural language understanding (NLU)
  • ERNIE-GEN is effective and achieves state-of-the-art results on a range of natural language generation tasks including abstractive summarization (Gigaword and CNN/DailyMail), question generation (SQuAD), dialogue generation (Persona-Chat) and generative question answering (CoQA), utilizing a much smaller amount of pre-training data and parameters
  • We present an enhanced multi-flow seq2seq pre-training and fine-tuning framework (ERNIE-GEN) for language generation, which incorporates a infilling generation mechanism and a noise-aware generation method to alleviate the exposure bias
  • ERNIE-GEN integrates a new span-by-span generation task to train the model to generate texts like human writing, which further improves the performance on downstream tasks
  • ERNIE-GEN achieves state-of-the-art results on a range of natural language generation tasks
Methods
  • The authors compare the ERNIE-GEN with previous works and conduct several ablation experiments to assess the performance of proposed methods in §3.

    4.1 Pre-training and Implementation Analogous to BERT and UNILM, ERNIE-GEN is trained on English Wikipedia1 and BookCorpus [Zhu et al, 2015], totaling 16GB.
  • The authors train a base model ERNIEGENBASE (L=12, H=768, A=12, Total Parameters=110M)2 and a large model ERNIE-GENLARGE (L=24, H=1024, A=16, Total Parameters=340M) with parameters initialized by BERTBASE and BERTLARGE respectively.
  • The peak learning rate is 5e-5 with warmup over the first 4,000 steps and linear decay scheduling.
  • The noising rate ρp for pre-training is 0.05.
  • Pre-training experiments are carried out on PaddlePaddle platforms3 and Nvidia Tesla V100 GPU.
  • By virtue of float16 mixed precision training, it takes almost 4 days for 400,000 steps to train ERNIEGENBASE while almost 7 days for 450,000 steps to train ERNIE-GENLARGE
Results
  • Evaluation Metric

    BLEU-4, METEOR (MTR), ROUGE-L (RG-L) ROUGE-F1 scores: ROUGE-1 (RG-1), ROUGE-2 (RG-2), ROUGE-L (RG-L) BLEU-1, BLEU-1, Distinct-1, Distinct-1 F1-score

    BERTSUMABS [Liu and Lapata, 2019] 16G

    UNILMLARGE [Dong et al, 2019]

    T5LARGE [Raffel et al, 2019]

    T5XLARGE [Raffel et al, 2019]

    BARTLARGE [Lewis et al, 2019]

    PEGASUS(C4) [Zhang et al, 2019]

    39.25 / 18.09 / 36.45 41.72 / 19.39 / 38.76 42.12 / 19.50 / 39.01 43.33 / 20.21 / 40.51 42.50 / 20.68 / 39.75 43.52 / 21.55 / 40.69 44.16 / 21.28 / 40.90 43.90 / 21.20 / 40.76 44.17 / 21.47 / 41.11

    ERNIE-GENBASE ERNIE-GENLARGE

    16G 110M 42.30 / 19.92 / 39.68 16G 340M 44.02 / 21.17 / 41.26

    The results on Gigaword with two scales (10k and 3.8M) are presented in Table 2, and the fine-tuning settings are shown in Table 1.
  • T5LARGE [Raffel et al, 2019].
  • T5XLARGE [Raffel et al, 2019].
  • On low-resource task (Gigaword 10k), ERNIE-GENBASE outperforms UNILMLARGE by +0.79 points in ROUGE-L while ERNIE-GENLARGE yields a gain of +1.94 ROUGE-L compared with UNILMLARGE.
  • On full Gigaword dataset, ERNIE-GENLARGE creates the state-of-the-art results, outperforming various pervious methods.
  • It is interesting to see that with model size scaling up, gains in low-resource tasks appear to be more remarkable
Conclusion
  • The authors present an enhanced multi-flow seq2seq pre-training and fine-tuning framework (ERNIE-GEN) for language generation, which incorporates a infilling generation mechanism and a noise-aware generation method to alleviate the exposure bias.
  • ERNIE-GEN integrates a new span-by-span generation task to train the model to generate texts like human writing, which further improves the performance on downstream tasks.
  • ERNIE-GEN achieves state-of-the-art results on a range of NLG tasks
Summary
  • Introduction:

    Pre-trained on large-scale unlabeled text corpora and finetuned on downstream tasks, self-supervised representation models such as GPT [Radford et al, 2018], BERT [Devlin et al, 2019] and XLNet [Yang et al, 2019b] have achieved remarkable improvements in natural language understanding (NLU).
  • Different from encoder-only pre-training like BERT or decoder-only pre-training like GPT, natural language generation (NLG) relies on the sequence to sequence generation framework which consists of a bidirectional encoder and a unidirectional decoder.
  • Current pre-training works in NLG such as MASS [Song et al, 2019] and UNILM [Dong et al, 2019] mainly focus on jointly pre-training encoder and decoder on different self-supervised tasks.
  • 1_n t3 Tokens (a) Typical generation mechanism t1 \\ Source t1 [A].
  • Methods:

    The authors compare the ERNIE-GEN with previous works and conduct several ablation experiments to assess the performance of proposed methods in §3.

    4.1 Pre-training and Implementation Analogous to BERT and UNILM, ERNIE-GEN is trained on English Wikipedia1 and BookCorpus [Zhu et al, 2015], totaling 16GB.
  • The authors train a base model ERNIEGENBASE (L=12, H=768, A=12, Total Parameters=110M)2 and a large model ERNIE-GENLARGE (L=24, H=1024, A=16, Total Parameters=340M) with parameters initialized by BERTBASE and BERTLARGE respectively.
  • The peak learning rate is 5e-5 with warmup over the first 4,000 steps and linear decay scheduling.
  • The noising rate ρp for pre-training is 0.05.
  • Pre-training experiments are carried out on PaddlePaddle platforms3 and Nvidia Tesla V100 GPU.
  • By virtue of float16 mixed precision training, it takes almost 4 days for 400,000 steps to train ERNIEGENBASE while almost 7 days for 450,000 steps to train ERNIE-GENLARGE
  • Results:

    Evaluation Metric

    BLEU-4, METEOR (MTR), ROUGE-L (RG-L) ROUGE-F1 scores: ROUGE-1 (RG-1), ROUGE-2 (RG-2), ROUGE-L (RG-L) BLEU-1, BLEU-1, Distinct-1, Distinct-1 F1-score

    BERTSUMABS [Liu and Lapata, 2019] 16G

    UNILMLARGE [Dong et al, 2019]

    T5LARGE [Raffel et al, 2019]

    T5XLARGE [Raffel et al, 2019]

    BARTLARGE [Lewis et al, 2019]

    PEGASUS(C4) [Zhang et al, 2019]

    39.25 / 18.09 / 36.45 41.72 / 19.39 / 38.76 42.12 / 19.50 / 39.01 43.33 / 20.21 / 40.51 42.50 / 20.68 / 39.75 43.52 / 21.55 / 40.69 44.16 / 21.28 / 40.90 43.90 / 21.20 / 40.76 44.17 / 21.47 / 41.11

    ERNIE-GENBASE ERNIE-GENLARGE

    16G 110M 42.30 / 19.92 / 39.68 16G 340M 44.02 / 21.17 / 41.26

    The results on Gigaword with two scales (10k and 3.8M) are presented in Table 2, and the fine-tuning settings are shown in Table 1.
  • T5LARGE [Raffel et al, 2019].
  • T5XLARGE [Raffel et al, 2019].
  • On low-resource task (Gigaword 10k), ERNIE-GENBASE outperforms UNILMLARGE by +0.79 points in ROUGE-L while ERNIE-GENLARGE yields a gain of +1.94 ROUGE-L compared with UNILMLARGE.
  • On full Gigaword dataset, ERNIE-GENLARGE creates the state-of-the-art results, outperforming various pervious methods.
  • It is interesting to see that with model size scaling up, gains in low-resource tasks appear to be more remarkable
  • Conclusion:

    The authors present an enhanced multi-flow seq2seq pre-training and fine-tuning framework (ERNIE-GEN) for language generation, which incorporates a infilling generation mechanism and a noise-aware generation method to alleviate the exposure bias.
  • ERNIE-GEN integrates a new span-by-span generation task to train the model to generate texts like human writing, which further improves the performance on downstream tasks.
  • ERNIE-GEN achieves state-of-the-art results on a range of NLG tasks
Tables
  • Table1: Hyperparamters of fine-tuning for ERNIE-GENBASE and ERNIE-GENLARGE
  • Table2: Comparison on Gigaword dataset with state-of-the-art results. Models in the upper block use 10k sample for fine-tuning. We also report the size of pre-training data and parameters utilized for each listed model (columns 2-3). RG is short for ROUGE
  • Table3: Evaluation results on CNN/DailyMail. C4 and HugeNews are two massive datasets of 750G and 3.8T respectively
  • Table4: SQuAD QG results. Models in the upper block and lower block use different test ↔ dev split method
  • Table5: Comparison with state-of-the-art results on Persona-Chat
  • Table6: Generative question answering results on the development set of CoQA
  • Table7: Ablation study for ERNIE-GENBASE and its variants. Particularly, ERNIE-GEN sets ρp = 0.05 in pre-training (row 1), while removing the span-by-span generation task (row 3), we set ρp = 0.2 because the training becomes easier
  • Table8: Results of models pre-trained with typical generation and infilling generation. Tasks in the upper block are fine-tuned without noising, while the others are fine-tuned with noise-aware generation
Download tables as Excel
Related work
  • Pre-Training for NLP Tasks. Recently, pre-training methods have achieved state-of-the-art results in multiple NLU tasks. ELMo [Peters et al, 2018] pre-trains two unidirectional language models (LMs) with forward and backward direction respectively to feature downstream tasks. GPT utilizes an adjusted Transformer [Vaswani et al, 2017] to learn a forward LM and then fine-tunes the forward LM on supervised datasets. BERT proposes a masked language modeling (MLM) task to learn deep bidirectional representations. Nevertheless, above methods are usually implemented by just one encoder or decoder, which is less effective in encoder-decoder based generation tasks, thus several works have preliminarily explored the pre-training towards NLG by incorporating BERT’s MLM into the seq2seq framework and shown excellent performance on a range of generation tasks. MASS masks a consecutive fragment (50%) of the input sentence with artificial [MASK] to predict. UNILM masks several words in the input sequence which is a pair of segments for encoder and decoder, and then predicts the masked words in accordance with BERT’s MLM. BART [Lewis et al, 2019] corrupts the input sequence and trains the model to generate original sequence as a denoising autoencoder.
Reference
  • [Bao et al., 2019] Siqi Bao, Huang He, Fan Wang, and Hua Wu. Plato: Pre-trained dialogue generation model with discrete latent variable. arXiv preprint arXiv:1910.07931, 2019.
    Findings
  • [Devlin et al., 2019] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, pages 4171–4186, 2019.
    Google ScholarLocate open access versionFindings
  • [Dong et al., 2019] Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiaowuen Hon. Unified language model pre-training for natural language understanding and generation. In NIPS, pages 13042–13054, 2019.
    Google ScholarLocate open access versionFindings
  • [Hermann et al., 2015] Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. Teaching machines to read and comprehend. In NIPS, pages 1693–1701, 2015.
    Google ScholarLocate open access versionFindings
  • [Joshi et al., 2019] Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S Weld, Luke Zettlemoyer, and Omer Levy. Spanbert: Improving pre-training by representing and predicting spans. arXiv preprint arXiv:1907.10529, 2019.
    Findings
  • [Lewis et al., 2019] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019.
    Findings
  • [Liu and Lapata, 2019] Yang Liu and Mirella Lapata. Text summarization with pretrained encoders. In EMNLPIJCNLP, pages 3721–3731, 2019.
    Google ScholarLocate open access versionFindings
  • [Peters et al., 2018] Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In NAACL-HLT, pages 2227–2237, 2018.
    Google ScholarLocate open access versionFindings
  • [Radford et al., 2018] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018.
    Google ScholarFindings
  • [Raffel et al., 2019] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019.
    Findings
  • [Rajpurkar et al., 2016] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. In EMNLP, pages 2383–2392, 2016.
    Google ScholarLocate open access versionFindings
  • [Ranzato et al., 2016] Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. In ICLR, 2016.
    Google ScholarLocate open access versionFindings
  • [Reddy et al., 2019] Siva Reddy, Danqi Chen, and Christopher D Manning. Coqa: A conversational question answering challenge. In ACL, pages 249–266, 2019.
    Google ScholarLocate open access versionFindings
  • [Rothe et al., 2019] Sascha Rothe, Shashi Narayan, and Aliaksei Severyn. Leveraging pre-trained checkpoints for sequence generation tasks. arXiv preprint arXiv:1907.12461, 2019.
    Findings
  • [Rush et al., 2015] Alexander M Rush, Sumit Chopra, and Jason Weston. A neural attention model for abstractive sentence summarization. In EMNLP, 2015.
    Google ScholarLocate open access versionFindings
  • [Song et al., 2019] Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. Mass: Masked sequence to sequence pre-training for language generation. In ICML, 2019.
    Google ScholarLocate open access versionFindings
  • [Sun et al., 2019] Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, and Hua Wu. Ernie: Enhanced representation through knowledge integration. arXiv preprint arXiv:1904.09223, 2019.
    Findings
  • [Sun et al., 2020] Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Hao Tian, Hua Wu, and Haifeng Wang. Ernie 2.0: A continual pre-training framework for language understanding. In AAAI, 2020.
    Google ScholarLocate open access versionFindings
  • [Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, pages 5998–6008, 2017.
    Google ScholarLocate open access versionFindings
  • [Wang et al., 2018] Li Wang, Junlin Yao, Yunzhe Tao, Li Zhong, Wei Liu, and Qiang Du. A reinforced topicaware convolutional sequence-to-sequence model for abstractive text summarization. In IJCAI, 2018.
    Google ScholarLocate open access versionFindings
  • [Wang et al., 2019] Wenlin Wang, Zhe Gan, Hongteng Xu, Ruiyi Zhang, Guoyin Wang, Dinghan Shen, Changyou Chen, and Lawrence Carin. Topic-guided variational autoencoder for text generation. In NAACL-HLT, 2019.
    Google ScholarLocate open access versionFindings
  • [Yang et al., 2019a] Qian Yang, Dinghan Shen, Yong Cheng, Wenlin Wang, Guoyin Wang, Lawrence Carin, et al. An end-to-end generative architecture for paraphrase generation. In EMNLP-IJCNLP, pages 3123–3133, 2019.
    Google ScholarLocate open access versionFindings
  • [Yang et al., 2019b] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. Xlnet: Generalized autoregressive pretraining for language understanding. In NIPS, pages 5754–5764, 2019.
    Google ScholarLocate open access versionFindings
  • [Zhang and Bansal, 2019] Shiyue Zhang and Mohit Bansal. Addressing semantic drift in question generation for semisupervised question answering. In EMNLP-IJCNLP, pages 2495–2509, 2019.
    Google ScholarLocate open access versionFindings
  • [Zhang et al., 2018] Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. Personalizing dialogue agents: I have a dog, do you have pets too? In ACL, pages 2204–2213, 2018.
    Google ScholarLocate open access versionFindings
  • [Zhang et al., 2019] Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J Liu. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. arXiv preprint arXiv:1912.08777, 2019.
    Findings
  • [Zhao et al., 2018] Yao Zhao, Xiaochuan Ni, Yuanyuan Ding, and Qifa Ke. Paragraph-level neural question generation with maxout pointer and gated self-attention networks. In EMNLP, pages 3901–3910, 2018.
    Google ScholarLocate open access versionFindings
  • [Zhu et al., 2015] Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and
    Google ScholarFindings
  • Sanja Fidler. Aligning books and movies: Towards storylike visual explanations by watching movies and reading books. In ICCV, pages 19–27, 2015.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments