LAMOL: LAnguage MOdeling for Lifelong Language Learning

ICLR, 2019.

Cited by: 0|Bibtex|Views2|Links
EI
Keywords:
NLP Deep Learning Lifelong Learning
Weibo:
We propose LAMOL, a simple yet effective method for lifelong language learning based on language modeling

Abstract:

Most research on lifelong learning applies to images or games, but not language. We present LAMOL, a simple yet effective method for lifelong language learning (LLL) based on language modeling. LAMOL replays pseudo-samples of previous tasks while requiring no extra memory or model capacity. Specifically, LAMOL is a langu...More
Introduction
  • The current dominant paradigm for machine learning is to run an algorithm on a given dataset to produce a trained model for a particular purpose; this is isolated learning (Chen & Liu, 2016, p. 150).
  • The model is unable to retain and accumulate the knowledge it has learned before.
  • Lifelong learning is designed to address a stream of tasks by accumulating interconnected knowledge between learned tasks and retaining the performance of those tasks.
  • We focus on lifelong language learning, where a machine achieves lifelong learning on a stream of natural language processing (NLP) tasks.
  • To achieve lifelong language learning on fundamentally different tasks, we propose LAMOL — LAnguage MOdeling for Lifelong language learning
Highlights
  • The current dominant paradigm for machine learning is to run an algorithm on a given dataset to produce a trained model specifically for a particular purpose; this is isolated learning (Chen & Liu, 2016, p. 150)
  • We focus on lifelong language learning, where a machine achieves lifelong learning on a stream of natural language processing (NLP) tasks
  • To achieve lifelong language learning on fundamentally different tasks, we propose LAMOL — LAnguage MOdeling for Lifelong language learning
  • The GPT-2 model has the potential for superior lifelong language learning performance, as long as we can prevent catastrophic forgetting
  • We propose LAMOL, a simple yet effective method for lifelong language learning based on language modeling
  • Any pre-trained language model can be used to leverage a large amount of unlabeled text to improve lifelong language learning
Methods
  • METHODS TO BE COMPARED

    All methods use the smallest pre-trained GPT-2 model (Radford et al, 2019)1 as the LM.
  • LAMOLγGEN denotes LAMOL with a sampling ratio of γ, and the same GEN token is used for all tasks.
  • If the task-specific tokens are used, GEN is replaced by TASK.
  • The quantity of real samples is split between previous tasks.
  • This approach can be considered the upper bound of LAMOL.
  • We denote it as LAMOLγREAL.
Results
  • 5.1 SINGLE TASK

    To establish a reference on the capability of the GPT-2 model on every dataset, we trained the model on each dataset independently.
  • We observe that the performance of the GPT-2 model is quite good, even beating the BERT-based model (d’Autume et al, 2019) on text classification datasets by a large margin.
  • 5.2 SST, QA-SRL, AND WOZ TASKS.
  • For an initial understanding of the performance on all of the methods and the effect of task order, we first conducted a small-scale experiment on three small datasets: SST, QA-SRL, and WOZ.
  • The results are shown in Table 3; we make several observations.
Conclusion
  • We propose LAMOL, a simple yet effective method for LLL based on language modeling.
  • A single LM achieves LLL without additional model components and without keeping old examples.
  • Any pre-trained LM can be used to leverage a large amount of unlabeled text to improve LLL.
  • More tasks can be added whenever needed
Summary
  • Introduction:

    The current dominant paradigm for machine learning is to run an algorithm on a given dataset to produce a trained model for a particular purpose; this is isolated learning (Chen & Liu, 2016, p. 150).
  • The model is unable to retain and accumulate the knowledge it has learned before.
  • Lifelong learning is designed to address a stream of tasks by accumulating interconnected knowledge between learned tasks and retaining the performance of those tasks.
  • We focus on lifelong language learning, where a machine achieves lifelong learning on a stream of natural language processing (NLP) tasks.
  • To achieve lifelong language learning on fundamentally different tasks, we propose LAMOL — LAnguage MOdeling for Lifelong language learning
  • Methods:

    METHODS TO BE COMPARED

    All methods use the smallest pre-trained GPT-2 model (Radford et al, 2019)1 as the LM.
  • LAMOLγGEN denotes LAMOL with a sampling ratio of γ, and the same GEN token is used for all tasks.
  • If the task-specific tokens are used, GEN is replaced by TASK.
  • The quantity of real samples is split between previous tasks.
  • This approach can be considered the upper bound of LAMOL.
  • We denote it as LAMOLγREAL.
  • Results:

    5.1 SINGLE TASK

    To establish a reference on the capability of the GPT-2 model on every dataset, we trained the model on each dataset independently.
  • We observe that the performance of the GPT-2 model is quite good, even beating the BERT-based model (d’Autume et al, 2019) on text classification datasets by a large margin.
  • 5.2 SST, QA-SRL, AND WOZ TASKS.
  • For an initial understanding of the performance on all of the methods and the effect of task order, we first conducted a small-scale experiment on three small datasets: SST, QA-SRL, and WOZ.
  • The results are shown in Table 3; we make several observations.
  • Conclusion:

    We propose LAMOL, a simple yet effective method for LLL based on language modeling.
  • A single LM achieves LLL without additional model components and without keeping old examples.
  • Any pre-trained LM can be used to leverage a large amount of unlabeled text to improve LLL.
  • More tasks can be added whenever needed
Tables
  • Table1: Summary of tasks, datasets, dataset sizes, and their corresponding metrics. As this work uses no development set, only the training and test datasets are shown. nF1 is the normalized version of the F1 score; EM represents an exact match between texts: for text classification, this amounts to accuracy; for WOZ, it is equivalent to dfEM (turn-based dialogue state exact match); for WikiSQL, it is equivalent to lfEM (exact match of logical forms)
  • Table2: Comparison of GPT-2 and other methods on single task scores. Other scores are retrieved from Bryan McCann & Socher (2018) or d’Autume et al (2019). Better performance in boldface
  • Table3: Summary of averaged metric scores for different methods under permuted task orders using models at last epoch of last task. The Average and Std columns respectively are the average and standard deviation of the averaged scores for each row of the methods. Multitasked learning as an upper bound is shown at the bottom
  • Table4: Summary of averaged score on five tasks. The scores are reported as the averaged score over all tasks of the models after training on every task. The rightmost three columns – LAMOL with γ = 0.05 and γ = 0.2 of real samples from previous tasks and Multitasked – are upper bounds for comparison. Best performance in boldface
  • Table5: Summary of results on text classification tasks using averaged EM score (equivalent to averaged accuracy in d’Autume et al (2019)) of models at last epoch of last task. The four orders mirror those in d’Autume et al (2019). For MBPA++ (out impl.) and LAMOL0TA.2SK, the results are averaged over two runs. The p-value of pairted t-test between eight numbers of MBPA++ (our impl.) and LAMOL0TA.2SK is smaller than 1%, which shows that there is significant difference. Our implementation of MBPA++ is available at https://github.com/Daikon-Sun/EM-in-LLL
  • Table6: Summary of averaged score on reversed five tasks. The scores are reported as the averaged score over all tasks of the models after training on every task. The rightmost three columns – LAMOL with γ = 0.05 and γ = 0.2 of real samples from previous tasks. Best performance in boldface
  • Table7: Examples generated by LAMOL with task-specific tokens. Annotations squad1 , wikisql , sst , srl correspond to each task-specific token of SQuAD, WikiSQL, SST, and
Download tables as Excel
Related work
  • Lifelong learning research is based on regularization, architecture, or data. Here is a brief survey of works in these three categories.

    2.1 REGULARIZATION-BASED METHODS

    In this approach, a constraint, i.e., a regularization term, is added to minimize deviation from trained weights while updating the weights in a new task. Most regularization based methods estimate the importance of each parameter and add the importance as a constraint to the loss function. Elastic weight consolidation (EWC) (Kirkpatrick et al, 2017) calculates a Fisher information matrix to estimate the sensitivity of parameters as importance. Online EWC (Schwarz et al, 2018) is a transformed version of EWC. Instead of tracking the importance of parameters for each task, online EWC simply accumulates the importance of the stream of tasks. Synaptic intelligence (SI) (Zenke et al, 2017) assigns importance to each parameter according to its contribution to the change in the total loss. Memory aware synapses (MAS) (Aljundi et al, 2018) estimate importance via the gradients of the model outputs. In contrast to estimating the importance of weights, incremental moment matching (IMM) (Lee et al, 2017) matches the moment of weights between different tasks.
Funding
  • This work was supported by the Ministry of Science and Technology of Taiwan
Study subjects and analysis
small datasets: 3
5.2 SST, QA-SRL, AND WOZ TASKS. For an initial understanding of the performance on all of the methods and the effect of task order, we first conducted a small-scale experiment on three small datasets: SST, QA-SRL, and WOZ. We trained all but the the multitasked method on all six permutations of the task order

Reference
  • Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars. Memory aware synapses: Learning what (not) to forget. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 139–154, 2018.
    Google ScholarLocate open access versionFindings
  • Caiming Xiong Bryan McCann, Nitish Shirish Keskar and Richard Socher. The natural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730, 2018.
    Findings
  • Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efficient lifelong learning with a-gem. arXiv preprint arXiv:1812.00420, 2018.
    Findings
  • Tianqi Chen, Ian Goodfellow, and Jonathon Shlens. Net2net: Accelerating learning via knowledge transfer. arXiv preprint arXiv:1511.05641, 2015a.
    Findings
  • Zhiyuan Chen and Bing Liu. Lifelong Machine Learning. Morgan & Claypool Publishers, 2016.
    Google ScholarFindings
  • Zhiyuan Chen, Nianzu Ma, and Bing Liu. Lifelong learning for sentiment classification. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), 2015b.
    Google ScholarLocate open access versionFindings
  • Cyprien de Masson d’Autume, Sebastian Ruder, Lingpeng Kong, and Dani Yogatama. Episodic memory in lifelong language learning. arXiv preprint arXiv:1906.01076, 2019.
    Findings
  • Chrisantha Fernando, Dylan Banarse, Charles Blundell, Yori Zwols, David Ha, Andrei A Rusu, Alexander Pritzel, and Daan Wierstra. Pathnet: Evolution channels gradient descent in super neural networks. arXiv preprint arXiv:1701.08734, 2017.
    Findings
  • Jaehong Kim Hanul Shin, Jung Kwon Lee and Jiwon Kim. Continual learning with deep generative replay. arXiv preprint arXiv:1705.08690, 2017.
    Findings
  • Luheng He, Kenton Lee, Mike Lewis, and Luke Zettlemoyer. Deep semantic role labeling: What works and whats next. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 473–483, 2017.
    Google ScholarLocate open access versionFindings
  • Ronald Kemker and Christopher Kanan. Fearnet: Brain-inspired model for incremental learning. arXiv preprint arXiv:1711.10563, 2017.
    Findings
  • James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
    Google ScholarLocate open access versionFindings
  • Sang-Woo Lee, Jin-Hwa Kim, Jaehyun Jun, Jung-Woo Ha, and Byoung-Tak Zhang. Overcoming catastrophic forgetting by incremental moment matching. In Advances in neural information processing systems, pp. 4652–4662, 2017.
    Google ScholarLocate open access versionFindings
  • Sungjin Lee. Toward continual learning for conversational agents. In arXiv, 2017.
    Google ScholarLocate open access versionFindings
  • Zhizhong Li and Derek Hoiem. Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017.
    Google ScholarLocate open access versionFindings
  • Tianlin Liu, Lyle Ungar, and Joao Sedoc. Continual learning for sentence representations using conceptors. In NAACL-HLT, 2019.
    Google ScholarLocate open access versionFindings
  • David Lopez-Paz et al. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, pp. 6467–6476, 2017.
    Google ScholarLocate open access versionFindings
  • Arun Mallya and Svetlana Lazebnik. Packnet: Adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7765–7773, 2018.
    Google ScholarLocate open access versionFindings
  • Alec Radford, Rafal Jozefowicz, and Ilya Sutskever. Learning to generate reviews and discovering sentiment. arXiv preprint arXiv:1704.01444, 2017.
    Findings
  • Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 2019.
    Google ScholarLocate open access versionFindings
  • Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
    Findings
  • Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016.
    Findings
  • Jonathan Schwarz, Jelena Luketina, Wojciech M Czarnecki, Agnieszka Grabska-Barwinska, Yee Whye Teh, Razvan Pascanu, and Raia Hadsell. Progress & compress: A scalable framework for continual learning. arXiv preprint arXiv:1805.06370, 2018.
    Findings
  • Shagun Sodhani, Sarath Chandar, and Yoshua Bengio. On training recurrent neural networks for lifelong learning. arXiv preprint arXiv:1811.07017, 2018.
    Findings
  • Tsung-Hsien Wen, David Vandyke, Nikola Mrksic, Milica Gasic, Lina M Rojas-Barahona, Pei-Hao Su, Stefan Ultes, and Steve Young. A network-based end-to-end trainable task-oriented dialogue system. arXiv preprint arXiv:1604.04562, 2016.
    Findings
  • R. Xia, J. Jiang, and H. He. Distantly supervised lifelong learning for large-scale social media sentiment analysis. IEEE Transactions on Affective Computing, 8(4):480–491, 2017.
    Google ScholarLocate open access versionFindings
  • Yann LeCun Xiang Zhang, Junbo Zhao. Character-level convolutional networks for text classification. arXiv preprint arXiv:1509.01626, 2015.
    Findings
  • Hu Xu, Bing Liu, Lei Shu, and Philip S. Yu. Lifelong domain word embedding via meta-learning. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, 2018.
    Google ScholarLocate open access versionFindings
  • Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3987– 3995. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • Victor Zhong, Caiming Xiong, and Richard Socher. Seq2sql: Generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103, 2017.
    Findings
  • Five tasks and their corresponding datasets from decaNLP (Bryan McCann & Socher, 2018):
    Google ScholarFindings
  • Four text classification tasks and five datasets from MBPA++ (dAutume et al. 2019):
    Google ScholarLocate open access versionFindings
  • The dataset collected by Xiang Zhang (2015) is available at http://goo.gl/JyCnZq. Given the unbalanced dataset sizes, we randomly sample 115,000 training examples and 7,600 test examples from all the datasets per d’Autume et al. (2019). All the tasks use exact match accuracy as the evaluation metric.
    Locate open access versionFindings
Your rating :
0

 

Tags
Comments