Improved Schemes for Episodic Memory-based Lifelong Learning

NIPS 2020, 2020.

Cited by: 0|Bibtex|Views21
Weibo:
We show that both Gradient Episodic Memory and Averaged Gradient Episodic Memory are degenerate cases of Mixed Stochastic Gradient-I and MEGA-II which consistently put the same emphasis on the current task, regardless of how the loss changes over time

Abstract:

Current deep neural networks can achieve remarkable performance on a single task. However, when the deep neural network is continually trained on a sequence of tasks, it seems to gradually forget the previous learned knowledge. This phenomenon is referred to as catastrophic forgetting and motivates the field called lifelong learning. Rece...More

Code:

Data:

0
Introduction
  • A significant step towards artificial general intelligence (AGI) is to enable the learning agent to acquire the ability of remembering past experiences while being trained on a continuum of tasks [3, 4, 5].
  • When the network is retrained on a new task, its performance drops drastically on previously trained tasks, a phenomenon which is referred to as catastrophic forgetting [7, 8, 9, 10, 11, 12, 13, 14].
  • A central dilemma in lifelong learning is how to achieve a balance between the performance on old tasks and the new task [4, 7, 18, 20].
  • In episodic memory based methods, a small episodic memory is used for storing examples from old tasks to guide the optimization of the current task
Highlights
  • A significant step towards artificial general intelligence (AGI) is to enable the learning agent to acquire the ability of remembering past experiences while being trained on a continuum of tasks [3, 4, 5]
  • Current deep neural networks are capable of achieving remarkable performance on a single task [6]
  • We show that both Gradient Episodic Memory (GEM) [1] and Averaged Gradient Episodic Memory (A-GEM) [2] are degenerate cases of Mixed Stochastic Gradient (MEGA)-I and MEGA-II which consistently put the same emphasis on the current task, regardless of how the loss changes over time
  • (1) We present the first unified view of current episodic memory based lifelong learning methods including GEM [1] and A-GEM [2]. (2) From the presented unified view, we propose two different schemes, called MEGA-I and MEGA-II, for lifelong learning problems
  • Based on the unified view, we propose two improved schemes called MEGA-I and
  • Extensive experimental results show that the proposed MEGA-I and MEGA-II achieve superior performance, significantly advancing the state-of-the-art on several standard benchmarks
Methods
  • The authors see that except on Split CIFAR, MEGA-II (`t = `ref) outperforms A-GEM on all the datasets.
  • This demonstrates the benefits of the proposed approach for rotating the current gradient.
  • By considering the loss information as in MEGA-II, the authors further improve the results on all the datasets
  • This shows that both of the components contribute to the improvements of the proposed schemes
Results
  • 6.2.1 MEGA VS Baselines

    In Fig. 1 the authors show the results across different measures on all the benchmark datasets.
  • As the authors can see from the memory comparison, PROG-NN is very memory inefficient since it allocates a new network for each task, the number of parameters grows super-linearly with the number of tasks.
  • This becomes problematic when large networks are being used.
Conclusion
  • The authors cast the lifelong learning problem as an optimization problem with composite objective, which provides a unified view to cover current episodic memory based lifelong learning algorithms.
  • Based on the unified view, the authors propose two improved schemes called MEGA-I and.
  • MEGA-II.
  • Extensive experimental results show that the proposed MEGA-I and MEGA-II achieve superior performance, significantly advancing the state-of-the-art on several standard benchmarks
Summary
  • Introduction:

    A significant step towards artificial general intelligence (AGI) is to enable the learning agent to acquire the ability of remembering past experiences while being trained on a continuum of tasks [3, 4, 5].
  • When the network is retrained on a new task, its performance drops drastically on previously trained tasks, a phenomenon which is referred to as catastrophic forgetting [7, 8, 9, 10, 11, 12, 13, 14].
  • A central dilemma in lifelong learning is how to achieve a balance between the performance on old tasks and the new task [4, 7, 18, 20].
  • In episodic memory based methods, a small episodic memory is used for storing examples from old tasks to guide the optimization of the current task
  • Methods:

    The authors see that except on Split CIFAR, MEGA-II (`t = `ref) outperforms A-GEM on all the datasets.
  • This demonstrates the benefits of the proposed approach for rotating the current gradient.
  • By considering the loss information as in MEGA-II, the authors further improve the results on all the datasets
  • This shows that both of the components contribute to the improvements of the proposed schemes
  • Results:

    6.2.1 MEGA VS Baselines

    In Fig. 1 the authors show the results across different measures on all the benchmark datasets.
  • As the authors can see from the memory comparison, PROG-NN is very memory inefficient since it allocates a new network for each task, the number of parameters grows super-linearly with the number of tasks.
  • This becomes problematic when large networks are being used.
  • Conclusion:

    The authors cast the lifelong learning problem as an optimization problem with composite objective, which provides a unified view to cover current episodic memory based lifelong learning algorithms.
  • Based on the unified view, the authors propose two improved schemes called MEGA-I and.
  • MEGA-II.
  • Extensive experimental results show that the proposed MEGA-I and MEGA-II achieve superior performance, significantly advancing the state-of-the-art on several standard benchmarks
Tables
  • Table1: Comparison of MEGA-I, MEGA-I (↵1(w) = 1, ↵2(w) = 1), MEGA-II, MEGA-II (`t = `ref) and A-GEM
Download tables as Excel
Related work
  • Several lifelong learning methods [25, 26] and evaluation protocols [27, 28] are proposed recently. We categorize the methods into different types based on the methodology, Regularization based approaches: EWC [4] adopted Fisher information matrix to prevent important weights for old tasks from changing drastically. In PI [21], the authors introduced intelligent synapses and endowed each individual synapse with a local measure of “importance” to avoid old memories from being overwritten. RWALK [22] utilized a KL-divergence based regularization for preserving knowledge of old tasks. While in MAS [29] the importance measure for each parameter of the network was computed based on how sensitive the predicted output function is to a change in this parameter. [30] extended MAS for task-free continual learning. In [31], an approximation of the Hessian was employed to approximate the posterior after every task. Uncertainties measures were also used to avoid catastrophic forgetting [32]. [33] proposed methods based on approximate Bayesian which recursively approximate the posterior of the given data.
Funding
  • This work was supported in part by CRISP, one of six centers in JUMP, an SRC program sponsored by DARPA
  • This work is also supported by NSF CHASE-CI #1730158, NSF FET #1911095, NSF CC* NPEO #1826967, NSF #1933212, NSF CAREER Award #1844403
  • The paper was also funded in part by SRC AIHW grants
Study subjects and analysis
mini-batches on different datasets: 10
Evolution of average accuracy during the lifelong learning process. LCA of first ten mini-batches on different datasets. The average accuracy and execution time when the number of examples is limited

Reference
  • David Lopez-Paz et al. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, pages 6467–6476, 2017.
    Google ScholarLocate open access versionFindings
  • Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efficient lifelong learning with a-gem. arXiv preprint arXiv:1812.00420, 2018.
    Findings
  • Robert M French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135, 1999.
    Google ScholarLocate open access versionFindings
  • James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
    Google ScholarLocate open access versionFindings
  • Roger Ratcliff. Connectionist models of recognition memory: constraints imposed by learning and forgetting functions. Psychological review, 97(2):285, 1990.
    Google ScholarLocate open access versionFindings
  • Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning, volume 1. MIT Press, 2016.
    Google ScholarFindings
  • Anthony Robins. Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science, 7(2):123–146, 1995.
    Google ScholarLocate open access versionFindings
  • Xilai Li, Yingbo Zhou, Tianfu Wu, Richard Socher, and Caiming Xiong. Learn to grow: A continual structure learning framework for overcoming catastrophic forgetting. arXiv preprint arXiv:1904.00310, 2019.
    Findings
  • Cuong V Nguyen, Alessandro Achille, Michael Lam, Tal Hassner, Vijay Mahadevan, and Stefano Soatto. Toward understanding catastrophic forgetting in continual learning. arXiv preprint arXiv:1908.01091, 2019.
    Findings
  • Cuong V Nguyen, Yingzhen Li, Thang D Bui, and Richard E Turner. Variational continual learning. arXiv preprint arXiv:1710.10628, 2017.
    Findings
  • Mehrdad Farajtabar, Navid Azizan, Alex Mott, and Ang Li. Orthogonal gradient descent for continual learning. arXiv preprint arXiv:1910.07104, 2019.
    Findings
  • David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Gregory Wayne. Experience replay for continual learning. In Advances in Neural Information Processing Systems, pages 348–358, 2019.
    Google ScholarLocate open access versionFindings
  • Michalis K Titsias, Jonathan Schwarz, Alexander G de G Matthews, Razvan Pascanu, and Yee Whye Teh. Functional regularisation for continual learning using gaussian processes. arXiv preprint arXiv:1901.11356, 2019.
    Findings
  • Tameem Adel, Han Zhao, and Richard E Turner. Continual learning with adaptive weights (claw). arXiv preprint arXiv:1911.09514, 2019.
    Findings
  • German I Parisi, Ronald Kemker, Jose L Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review. Neural Networks, 2019.
    Google ScholarLocate open access versionFindings
  • Sebastian Thrun and Tom M Mitchell. Lifelong robot learning. Robotics and autonomous systems, 15(1-2):25–46, 1995.
    Google ScholarLocate open access versionFindings
  • Jaehong Yoon, Eunho Yang, Jeongtae Lee, and Sung Ju Hwang. Lifelong learning with dynamically expandable networks. arXiv preprint arXiv:1708.01547, 2017.
    Findings
  • Sang-Woo Lee, Jin-Hwa Kim, Jaehyun Jun, Jung-Woo Ha, and Byoung-Tak Zhang. Overcoming catastrophic forgetting by incremental moment matching. In Advances in neural information processing systems, pages 4652–4662, 2017.
    Google ScholarLocate open access versionFindings
  • Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Bengio, and Geoffrey J Gordon. An empirical study of example forgetting during deep neural network learning. arXiv preprint arXiv:1812.05159, 2018.
    Findings
  • Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual learning with deep generative replay. In Advances in Neural Information Processing Systems, pages 2990–2999, 2017.
    Google ScholarLocate open access versionFindings
  • Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3987–3995. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • Arslan Chaudhry, Puneet K Dokania, Thalaiyasingam Ajanthan, and Philip HS Torr. Riemannian walk for incremental learning: Understanding forgetting and intransigence. In Proceedings of the European Conference on Computer Vision (ECCV), pages 532–547, 2018.
    Google ScholarLocate open access versionFindings
  • Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016.
    Findings
  • Matthew Riemer, Ignacio Cases, Robert Ajemian, Miao Liu, Irina Rish, Yuhai Tu, and Gerald Tesauro. Learning to learn without forgetting by maximizing transfer and minimizing interference. arXiv preprint arXiv:1810.11910, 2018.
    Findings
  • Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Ales Leonardis, Gregory Slabaugh, and Tinne Tuytelaars. Continual learning: A comparative study on how to defy forgetting in classification tasks. arXiv preprint arXiv:1909.08383, 2019.
    Findings
  • Ronald Kemker, Marc McClure, Angelina Abitino, Tyler L Hayes, and Christopher Kanan. Measuring catastrophic forgetting in neural networks. In Thirty-second AAAI conference on artificial intelligence, 2018.
    Google ScholarLocate open access versionFindings
  • Sebastian Farquhar and Yarin Gal. Towards robust evaluations of continual learning. arXiv preprint arXiv:1805.09733, 2018.
    Findings
  • Tyler L Hayes, Ronald Kemker, Nathan D Cahill, and Christopher Kanan. New metrics and experimental paradigms for continual learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 2031–2034, 2018.
    Google ScholarLocate open access versionFindings
  • Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars. Memory aware synapses: Learning what (not) to forget. In Proceedings of the European Conference on Computer Vision (ECCV), pages 139–154, 2018.
    Google ScholarLocate open access versionFindings
  • Rahaf Aljundi, Klaas Kelchtermans, and Tinne Tuytelaars. Task-free continual learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11254–11263, 2019.
    Google ScholarLocate open access versionFindings
  • Hippolyt Ritter, Aleksandar Botev, and David Barber. Online structured laplace approximations for overcoming catastrophic forgetting. In Advances in Neural Information Processing Systems, pages 3738–3748, 2018.
    Google ScholarLocate open access versionFindings
  • Sayna Ebrahimi, Mohamed Elhoseiny, Trevor Darrell, and Marcus Rohrbach. Uncertaintyguided continual learning with bayesian neural networks. arXiv preprint arXiv:1906.02425, 2019.
    Findings
  • Sebastian Farquhar and Yarin Gal. A unifying bayesian view of continual learning. arXiv preprint arXiv:1902.06494, 2019.
    Findings
  • Kibok Lee, Kimin Lee, Jinwoo Shin, and Honglak Lee. Overcoming catastrophic forgetting with unlabeled data in the wild. In Proceedings of the IEEE International Conference on Computer Vision, pages 312–321, 2019.
    Google ScholarLocate open access versionFindings
  • Guanxiong Zeng, Yang Chen, Bo Cui, and Shan Yu. Continual learning of context-dependent processing in neural networks. Nature Machine Intelligence, 1(8):364–372, 2019.
    Google ScholarLocate open access versionFindings
  • Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. arXiv preprint arXiv:1410.3916, 2014.
    Findings
  • Alexander Pritzel, Benigno Uria, Sriram Srinivasan, Adria Puigdomenech, Oriol Vinyals, Demis Hassabis, Daan Wierstra, and Charles Blundell. Neural episodic control. arXiv preprint arXiv:1703.01988, 2017.
    Findings
  • Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. One-shot learning with memory-augmented neural networks. arXiv preprint arXiv:1605.06065, 2016.
    Findings
  • Rahaf Aljundi, Min Lin, Baptiste Goujaud, and Yoshua Bengio. Gradient based sample selection for online continual learning. In Advances in Neural Information Processing Systems, pages 11816–11825, 2019.
    Google ScholarLocate open access versionFindings
  • Xu He and Herbert Jaeger. Overcoming catastrophic interference using conceptor-aided backpropagation. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • Ozan Sener and Vladlen Koltun. Multi-task learning as multi-objective optimization. In Advances in Neural Information Processing Systems, pages 527–538, 2018.
    Google ScholarLocate open access versionFindings
  • Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7482–7491, 2018.
    Google ScholarLocate open access versionFindings
  • Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. arXiv preprint arXiv:1711.02257, 2017.
    Findings
  • Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning. arXiv preprint arXiv:2001.06782, 2020.
    Findings
  • Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, and Yoram Singer. Online passive-aggressive algorithms. Journal of Machine Learning Research, 7(Mar):551–585, 2006.
    Google ScholarLocate open access versionFindings
  • Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2001–2010, 2017.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
    Google ScholarLocate open access versionFindings
  • Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In Advances in neural information processing systems, pages 4077–4087, 2017.
    Google ScholarLocate open access versionFindings
  • Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In Advances in neural information processing systems, pages 3630–3638, 2016.
    Google ScholarLocate open access versionFindings
  • Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1126–1135. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • Yunhui Guo, Noel CF Codella, Leonid Karlinsky, John R Smith, Tajana Rosing, and Rogerio Feris. A new benchmark for evaluation of cross-domain few-shot learning. arXiv preprint arXiv:1912.07200, 2019.
    Findings
  • Cyprien de Masson d’Autume, Sebastian Ruder, Lingpeng Kong, and Dani Yogatama. Episodic memory in lifelong language learning. In Advances in Neural Information Processing Systems, pages 13122–13131, 2019.
    Google ScholarLocate open access versionFindings
  • Yann LeCun, Corinna Cortes, and Christopher JC Burges. The mnist database. URL http://yann.lecun.com/exdb/mnist, 1998.
    Findings
  • Alex Krizhevsky et al. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
    Google ScholarFindings
  • Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. 2011.
    Google ScholarFindings
  • Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 951–958. IEEE, 2009.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments