AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
Continual learning methods like the one we propose allow machine learning models to efficiently learn on new data without requiring constant retraining on previous data

Continual Learning in Low-rank Orthogonal Subspaces

NIPS 2020, (2020)

Cited by: 0|Views115
EI
Full Text
Bibtex
Weibo

Abstract

In continual learning (CL), a learner is faced with a sequence of tasks, arriving one after the other, and the goal is to remember all the tasks once the continual learning experience is finished. The prior art in CL uses episodic memory, parameter regularization or extensible network structures to reduce interference among tasks, but i...More

Code:

Data:

0
Introduction
  • A learner experiences a sequence of tasks with the objective to remember all or most of the observed tasks to speed up transfer of knowledge to future tasks.
  • The chief one among which is catastrophic forgetting [McCloskey and Cohen, 1989], whereby the global update of model parameters on the present task interfere with the learned representations of past tasks.
  • Modular approaches [Rusu et al, 2016, Lee et al, 2017] add network components as new tasks arrive
  • These approaches rely on the knowledge of correct module selection at test time.
  • The authors assume that a task descriptor, identifying the correct classification head, is given at both train and test times
Highlights
  • In continual learning, a learner experiences a sequence of tasks with the objective to remember all or most of the observed tasks to speed up transfer of knowledge to future tasks
  • Our goal is to estimate a predictor f = (w ◦ Φ) : X × T → Y, composed of a feature extractor ΦΘ : X → H, which is an L-layer feed-forward neural network parameterized by Θ = {Wl}Ll=1, and a classifier wθ : H → Y, that minimizes the multi-task error
  • We report experiments on continual learning benchmarks in classification tasks
  • 6 Conclusion We presented ORTHOG-SUBSPACE, a continual learning method that learns different tasks in orthogonal subspaces
  • Continual learning methods like the one we propose allow machine learning models to efficiently learn on new data without requiring constant retraining on previous data
  • When the orthogonality is ensured by learning on a Stiefel manifold, the model achieves the best performance both in terms of accuracy and forgetting
  • A machine learning practitioner should be aware of this fact and use continual learning approaches only when suitable
Methods
  • MEMORY ACCURACY FORGETTING ACCURACY FORGETTING FINETUNE.
  • EWC [KIRKPATRICK ET AL., 2016].
  • VCL [NGUYEN ET AL., 2018].
  • VCL-RANDOM [NGUYEN ET AL., 2018].
Results
  • Tab. 1 shows the overall results on all benchmarks. First, the authors observe that on relatively shallower networks (MNIST benchmarks), even without memory and preservation of orthogonality during the network training, ORTHOG-SUBSPACE outperform the strong memory-based baselines by a large

    PROJECTION ER STIEFEL ACCURACY FORGETTING ACCURACY FORGETTING

    50.3 (±2.21) 0.21 (±0.02) 40.1 (±2.16) 0.20 (±0.02)

    59.6 (±1.19) 0.14 (±0.01) 49.8 (±2.92) 0.12 (±0.01)

    61.2 (±1.84) 0.10 (±0.01) 49.5 (±2.21) 0.11 (±0.01)

    64.3 (±0.59) 0.07 (±0.01) 51.4 (±1.44) 0.10 (±0.01) Count Count No Orthogonality

    Stiefel (Orthogonality)

    0 0.00 0.05 0.10 0.15 0.20 Inner product b\w Task and Memory gradients

    Inner product b\w Task and Memory gradients.
  • Tab. 1 shows the overall results on all benchmarks.
  • The authors observe that on relatively shallower networks (MNIST benchmarks), even without memory and preservation of orthogonality during the network training, ORTHOG-SUBSPACE outperform the strong memory-based baselines by a large.
  • PROJECTION ER STIEFEL ACCURACY FORGETTING ACCURACY FORGETTING.
  • 64.3 (±0.59) 0.07 (±0.01) 51.4 (±1.44) 0.10 (±0.01) Count Count No Orthogonality.
  • Inner product b\w Task and Memory gradients
Conclusion
  • The authors presented ORTHOG-SUBSPACE, a continual learning method that learns different tasks in orthogonal subspaces.
  • Continual learning methods like the one the authors propose allow machine learning models to efficiently learn on new data without requiring constant retraining on previous data.
  • This type of learning can be useful when the model is expected to perform in multiple environments and a simultaneous retraining on all the environments is not feasible.
  • A machine learning practitioner should be aware of this fact and use continual learning approaches only when suitable
Tables
  • Table1: Accuracy (2) and Forgetting (3) results of continual learning experiments. When used, episodic memories contain up to one example per class per task. Last row is a multi-task oracle baseline
  • Table2: Systematic evaluation of Projection, Memory and Orthogonalization in ORTHOG-SUBSPACE
  • Table3: Accuracy (2) and Forgetting (3) results of continual learning experiments for larger episodic memory sizes. 2, 3 and 5 samples per class per task are stored, respectively. Top table is for Split CIFAR. Bottom table is for Split miniImageNet
Download tables as Excel
Related work
  • In continual learning [Ring, 1997], also called lifelong learning [Thrun, 1998], a learner faces a sequence of tasks without storing the complete datasets of these tasks. This is in contrast to multitask learning [Caruana, 1997], where the learner can simultaneously access data from all tasks. The objective in continual learning is to avoid catastrophic forgetting The main challenge in continual learning is to avoid catastrophic forgetting [McCloskey and Cohen, 1989, McClelland et al, 1995, Goodfellow et al, 2013] on already seen tasks so that the learner is able to learn new tasks quickly. Existing literature in continual learning can be broadly categorized into three categories.

    First, regularization approaches reduce the drift in parameters important for past tasks [Kirkpatrick et al, 2016, Aljundi et al, 2018, Nguyen et al, 2018, Zenke et al, 2017]. For the large number of tasks, the parameter importance measures suffer from brittleness as the locality assumption embedded in the regularization-based approaches is violated [Titsias et al, 2019]. Furthermore, Chaudhry et al [2019a] showed that these approaches can only be effective when the learner can perform multiple passes over the datasets of each task – a scenario not assumed in this work. Second, modular approaches use different network modules that can be extended for each new task [Fernando et al, 2017, Aljundi et al, 2017, Rosenbaum et al, 2018, Chang et al, 2018, Xu and Zhu, 2018, Ferran Alet, 2018]. By construction, modular approaches have zero forgetting, but their memory requirements increase with the number of tasks [Rusu et al, 2016, Lee et al, 2017]. Third, memory approaches maintain and replay a small episodic memory of data from past tasks. In some of these methods [Li and Hoiem, 2016, Rebuffi et al, 2017], examples in the episodic memory are replayed and predictions are kept invariant by means of distillation [Hinton et al, 2014]. In other approaches [Lopez-Paz and Ranzato, 2017, Chaudhry et al, 2019a, Aljundi et al, 2019] the episodic memory is used as an optimization constraint that discourages increases in loss at past tasks. Some works [Hayes et al, 2018, Riemer et al, 2019, Rolnick et al, 2018, Chaudhry et al, 2019b, 2020] have shown that directly optimizing the loss on the episodic memory, also known as experience replay, is cheaper than constraint-based approaches and improves prediction performance. Recently, Prabhu et al [2020] showed that training at test time, using a greedily balanced collection of episodic memory, improved performance on a variety of benchmarks. Similarly, Javed and White [2019], Beaulieu et al [2020] showed that learning transferable representations via meta-learning reduces forgetting when the model is trained on sequential tasks.
Funding
  • This work was supported by EPSRC/MURI grant EP/N019474/1, Facebook (DeepFakes grant), Five AI UK, and the Royal Academy of Engineering under the Research Chair and Senior Research Fellowships scheme
  • AC is funded by the Amazon Research Award (ARA) program
Study subjects and analysis
samples: 10000
Rotated MNIST is another variant of MNIST, where each task applies a fixed random image rotation (between 0 and 180 degrees) to the original dataset. Both of the MNIST benchmark contain 23 tasks, each with 10000 samples from 10 different classes. Split CIFAR is a variant of the CIFAR-100 dataset [Krizhevsky and Hinton, 2009, Zenke et al, 2017], where each task contains the data pertaining 5 random classes (without replacement) out of the total 100 classes

samples: 250
Similar to Split CIFAR, in Split miniImageNet each task contains the data from 5 random classes (without replacement) out of the total 100 classes. Both CIFAR-100 and miniImageNet contain 20 tasks, each with 250 samples from each of the 5 classes. Similar to Chaudhry et al [2019a], for each benchmark, the first 3 tasks are used for hyper-parameter tuning (grids are available in Appendix D)

samples: 5
Systematic evaluation of Projection, Memory and Orthogonalization in ORTHOG-SUBSPACE. Accuracy (2) and Forgetting (3) results of continual learning experiments for larger episodic memory sizes. 2, 3 and 5 samples per class per task are stored, respectively. Top table is for Split CIFAR. Bottom table is for Split miniImageNet. ORTHOG-SUBSPACE. Each blob, with the three ellipses, represents a vector space and its subspaces at a certain layer. The projection operator in the layer L keeps the subspaces orthogonal (no overlap). The overlap in the intermediate layers is minimized when the weight matrices are learned on the Stiefel manifold

Reference
  • P.-A. Absil, R. Mahony, and R. Sepulchre. Optimization algorithms on matrix manifolds. Princeton University Press, 2009.
    Google ScholarFindings
  • R. Aljundi, P. Chakravarty, and T. Tuytelaars. Expert gate: Lifelong learning with a network of experts. In CVPR, pages 7120–7129, 2017.
    Google ScholarLocate open access versionFindings
  • R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, and T. Tuytelaars. Memory aware synapses: Learning what (not) to forget. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • R. Aljundi, M. Lin, B. Goujaud, and Y. Bengio. Online continual learning with no task boundaries. arXiv preprint arXiv:1903.08671, 2019.
    Findings
  • M. Arjovsky, A. Shah, and Y. Bengio. Unitary evolution recurrent neural networks. In International Conference on Machine Learning, pages 1120–1128, 2016.
    Google ScholarLocate open access versionFindings
  • S. Arora, S. S. Du, W. Hu, Z. Li, and R. Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584, 2019.
    Findings
  • S. Beaulieu, L. Frati, T. Miconi, J. Lehman, K. O. Stanley, J. Clune, and N. Cheney. Learning to continually learn. arXiv preprint arXiv:2002.09571, 2020.
    Findings
  • S. Bonnabel. Stochastic gradient descent on riemannian manifolds. IEEE Transactions on Automatic Control, 58(9):2217–2229, 2013.
    Google ScholarLocate open access versionFindings
  • R. Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997.
    Google ScholarLocate open access versionFindings
  • M. Chang, A. Gupta, S. Levine, and T. L. Griffiths. Automatically composing representation transformations as a means for generalization. In ICML workshop Neural Abstract Machines and Program Induction v2, 2018.
    Google ScholarLocate open access versionFindings
  • A. Chaudhry, P. K. Dokania, T. Ajanthan, and P. H. Torr. Riemannian walk for incremental learning: Understanding forgetting and intransigence. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • A. Chaudhry, M. Ranzato, M. Rohrbach, and M. Elhoseiny. Efficient lifelong learning with a-gem. In ICLR, 2019a.
    Google ScholarLocate open access versionFindings
  • A. Chaudhry, M. Rohrbach, M. Elhoseiny, T. Ajanthan, P. K. Dokania, P. H. Torr, and M. Ranzato. Continual learning with tiny episodic memories. arXiv preprint arXiv:1902.10486, 2019b.
    Findings
  • A. Chaudhry, A. Gordo, P. K. Dokania, P. Torr, and D. Lopez-Paz. Using hindsight to anchor past knowledge in continual learning. arXiv preprint arXiv:2002.08165, 2020.
    Findings
  • M. Farajtabar, N. Azizan, A. Mott, and A. Li. Orthogonal gradient descent for continual learning. arXiv preprint arXiv:1910.07104, 2019.
    Findings
  • C. Fernando, D. Banarse, C. Blundell, Y. Zwols, D. Ha, A. A. Rusu, A. Pritzel, and D. Wierstra. Pathnet: Evolution channels gradient descent in super neural networks. arXiv preprint arXiv:1701.08734, 2017.
    Findings
  • L. P. K. Ferran Alet, Tomas Lozano-Perez. Modular meta-learning. arXiv preprint arXiv:1806.10166v1, 2018.
    Findings
  • I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y. Bengio. An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211, 2013.
    Findings
  • T. L. Hayes, N. D. Cahill, and C. Kanan. Memory efficient experience replay for streaming learning. arXiv preprint arXiv:1809.05922, 2018.
    Findings
  • K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
    Google ScholarLocate open access versionFindings
  • G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In NIPS, 2014.
    Google ScholarLocate open access versionFindings
  • L. Huang, X. Liu, B. Lang, A. W. Yu, Y. Wang, and B. Li. Orthogonal weight normalization: Solution to optimization over multiple dependent stiefel manifolds in deep neural networks. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
    Google ScholarLocate open access versionFindings
  • D. Isele and A. Cosgun. Selective experience replay for lifelong learning. arXiv preprint arXiv:1802.10269, 2018.
    Findings
  • K. Javed and M. White. Meta-learning representations for continual learning. In Advances in Neural Information Processing Systems, pages 1820–1830, 2019.
    Google ScholarLocate open access versionFindings
  • K. Jia, S. Li, Y. Wen, T. Liu, and D. Tao. Orthogonal deep neural networks. IEEE transactions on pattern analysis and machine intelligence, 2019.
    Google ScholarLocate open access versionFindings
  • L. Jing, Y. Shen, T. Dubcek, J. Peurifoy, S. Skirlo, Y. LeCun, M. Tegmark, and M. Soljacic. Tunable efficient unitary neural networks (eunn) and their application to rnns. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1733–1741. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • J. Kirkpatrick, R. Pascanu, N. C. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, D. Hassabis, C. Clopath, D. Kumaran, and R. Hadsell. Overcoming catastrophic forgetting in neural networks. PNAS, 2016.
    Google ScholarLocate open access versionFindings
  • A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. https://www.cs.toronto.edu/kriz/cifar.html, 2009.
    Findings
  • Y. LeCun. The mnist database of handwritten digits. http://yann.lecun.com/exdb/mnist/, 1998.
    Findings
  • J. Lee, J. Yun, S. Hwang, and E. Yang. Lifelong learning with dynamically expandable networks. arXiv preprint arXiv:1708.01547, 2017.
    Findings
  • J. Li, F. Li, and S. Todorovic. Efficient riemannian optimization on the stiefel manifold via the cayley transform. In International Conference on Learning Representations, 2020.
    Google ScholarLocate open access versionFindings
  • Z. Li and D. Hoiem. Learning without forgetting. In ECCV, pages 614–629, 2016.
    Google ScholarLocate open access versionFindings
  • D. Lopez-Paz and M. Ranzato. Gradient episodic memory for continuum learning. In NIPS, 2017.
    Google ScholarLocate open access versionFindings
  • J. L. McClelland, B. L. McNaughton, and R. C. O’Reilly. Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. Psychological review, 102(3):419, 1995.
    Google ScholarLocate open access versionFindings
  • M. McCloskey and N. J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. Psychology of learning and motivation, 24:109–165, 1989.
    Google ScholarLocate open access versionFindings
  • C. V. Nguyen, Y. Li, T. D. Bui, and R. E. Turner. Variational continual learning. ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • A. Nichol and J. Schulman. Reptile: a scalable metalearning algorithm. arXiv preprint arXiv:1803.02999, 2018, 2018.
    Findings
  • Y. Nishimori and S. Akaho. Learning algorithms utilizing quasi-geodesic flows on the stiefel manifold. Neurocomputing, 67:106–135, 2005.
    Google ScholarLocate open access versionFindings
  • A. Prabhu, P. H. S. Torr, and P. K. Dokania. GDumb: A simple approach that questions our progress in continual learning. In ECCV, 2020.
    Google ScholarLocate open access versionFindings
  • S.-V. Rebuffi, A. Kolesnikov, and C. H. Lampert. iCaRL: Incremental classifier and representation learning. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • M. Riemer, I. Cases, R. Ajemian, M. Liu, I. Rish, Y. Tu, and G. Tesauro. Learning to learn without forgetting by maximizing transfer and minimizing interference. In ICLR, 2019.
    Google ScholarLocate open access versionFindings
  • M. B. Ring. Child: A first step towards continual learning. Machine Learning, 28(1):77–104, 1997.
    Google ScholarLocate open access versionFindings
  • D. Rolnick, A. Ahuja, J. Schwarz, T. P. Lillicrap, and G. Wayne. Experience replay for continual learning. CoRR, abs/1811.11682, 2018. URL http://arxiv.org/abs/1811.11682. C. Rosenbaum, T. Klinger, and M. Riemer. Routing networks: Adaptive selection of non-linear functions for multi-task learning. In ICLR, 2018.
    Findings
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 115(3):211–252, 2015.
    Google ScholarLocate open access versionFindings
  • A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016.
    Findings
  • T. Serra, C. Tjandraatmadja, and S. Ramalingam. Bounding and counting linear regions of deep neural networks. arXiv preprint arXiv:1711.02114, 2017. H. D. Tagare. Notes on optimization on stiefel manifolds. In Technical report, Technical report. Yale University, 2011. S. Thrun. Lifelong learning algorithms. In Learning to learn, pages 181–209.
    Findings
  • Springer, 1998. M. K. Titsias, J. Schwarz, A. G. d. G. Matthews, R. Pascanu, and Y. W. Teh. Functional regularisation for continual learning using gaussian processes. arXiv preprint arXiv:1901.11356, 2019.
    Findings
  • O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. Matching networks for one shot learning. In NIPS, pages 3630–3638, 2016.
    Google ScholarLocate open access versionFindings
  • S. Wisdom, T. Powers, J. Hershey, J. Le Roux, and L. Atlas. Full-capacity unitary recurrent neural networks. In Advances in neural information processing systems, pages 4880–4888, 2016. J. Xu and Z. Zhu. Reinforced continual learning. In arXiv preprint arXiv:1805.12369v1, 2018.
    Findings
  • F. Zenke, B. Poole, and S. Ganguli. Continual learning through synaptic intelligence. In ICML, 2017.
    Google ScholarLocate open access versionFindings
Author
Arslan Chaudhry
Arslan Chaudhry
Naeemullah Khan
Naeemullah Khan
Puneet Dokania
Puneet Dokania
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科