Continual Learning with Adaptive Weights (CLAW)

ICLR, (2020)

Cited by: 2|Views53
EI
Weibo:
We introduced a continual learning framework which learns how to adapt its architecture from the tasks and data at hand, based on variational inference

Abstract:

Approaches to continual learning aim to successfully learn a set of related tasks that arrive in an online manner. Recently, several frameworks have been developed which enable deep learning to be deployed in this learning scenario. A key modelling decision is to what extent the architecture should be shared across tasks. On the one hand,...More

Code:

Data:

ZH
Full Text
Bibtex
Weibo
Highlights
  • Continual learning (CL), sometimes called lifelong or incremental learning, refers to an online framework where the knowledge acquired from learning tasks in the past is kept and accumulated so that it can be reused in the present and future
  • We introduce our framework as an expansion of the variational continual learning algorithm (Nguyen et al, 2018), whose variational and sequential Bayesian nature makes it convenient for our modelling and architecture adaptation procedure
  • From Figure 3, we can see that Continual Learning with Adaptive Weights achieves state-of-the-art results in 4 out of the 5 experiments in terms of avoiding negative transfer
  • We provide more details of related work in Section A in the Appendix. (a) A complementary approach to Continual Learning with Adaptive Weights is the regularisation-based approach to balance adaptability with catastrophic forgetting: a level of stability is kept via protecting parameters that greatly influence the prediction against radical changes, while allowing the rest of the parameters to change without restriction (Li & Hoiem, 2016; Lee et al, 2017; Zenke et al, 2017; Chaudhry et al, 2018; Kim et al, 2018; Nguyen et al, 2018; Srivastava et al, 2013; Schwarz et al, 2018; Vuorio et al, 2018; Aljundi et al, 2019c)
  • We introduced a continual learning framework which learns how to adapt its architecture from the tasks and data at hand, based on variational inference
Summary
  • Continual learning (CL), sometimes called lifelong or incremental learning, refers to an online framework where the knowledge acquired from learning tasks in the past is kept and accumulated so that it can be reused in the present and future.
  • We propose a framework where the architecture, whose parameters are θ, is flexibly adapted based on the available tasks, via a learning procedure that will be described below.
  • We illustrate the proposed model to perform this adaptation by learning the probabilistic contributions of the different neurons within the network architecture on a task-by-task basis.
  • By minimising (11) with respect to pi,j and using samples from the respective distributions to assign values to αi,j, adapted contributions of each neuron j at each layer i of the network are learnt per task.
  • The algorithmic complexity of a single joint update of the parameters θ based on the additive terms in (12) is O(M ELD2), where L is the number of layers in the network, D is the number of neurons within a single layer, E is the number of samples taken from the random noise variable , and M is the minibatch size.
  • Our experiments mainly aim at evaluating the following: (i) the overall performance of the introduced CLAW, depicted by the average classification accuracy over all the tasks; the extent to which catastrophic forgetting can be mitigated when deploying CLAW; and the achieved degree of positive forward transfer.
  • The experiments demonstrate the effectiveness of CLAW in achieving state-of-the-art continual learning results measured by classification accuracy and by the achieved reduction in catastrophic forgetting.
  • An empirical conclusion that can be made out of this and the previous experiment, is that CLAW achieves better overall continual learning results, partially thanks to the way it addresses catastrophic forgetting.
  • The idea of adapting the architecture by adapting the contributions of neurons of each layer seems to be working well with datasets like Omniglot and CIFAR-100, giving directions for imminent future work where CLAW can be extended for other application areas based on CNNs. 4.3 POSITIVE FORWARD TRANSFER
  • CLAW can as well be seen as a combination of a regularisation-based approach and a modelling approach which automates the architecture building process in a data-driven manner, avoiding the overhead resulting from either storing or generating data points from previous tasks.
  • We introduced a continual learning framework which learns how to adapt its architecture from the tasks and data at hand, based on variational inference.
  • Results of six different experiments on five datasets demonstrate the strong empirical performance of the introduced framework, in terms of the average overall continual learning accuracy and forward transfer, and in terms of effectively alleviating catastrophic forgetting.
Tables
  • Table1: Average test classification accuracy of the last two tasks in each of the six experiments: Permuted MNIST, Split MNIST, Split notMNIST, Split Fashion-MNIST, Omniglot and CIFAR-100, followed by the corresponding standard error. A bold entry denotes that the classification accuracy of an algorithm is significantly higher than its competitors. Significance results are identified using a paired t-test with p = 0.05. Average classification accuracy resulting from CLAW is significantly higher than its competitors on the 6 experiments
  • Table2: Wall-clock run time (in seconds) after finishing training in each of the six experiments: Permuted MNIST, Split MNIST, Split notMNIST, Split Fashion-MNIST, Omniglot and CIFAR-100. As mentioned earlier, the statistics reported are averages of ten repetitions
Related work
  • We briefly discuss three related approaches to continual learning: (a) regularisation-based, (b) architecture-based and (c) memory-based. We provide more details of related work in Section A in the Appendix. (a) A complementary approach to CLAW is the regularisation-based approach to balance adaptability with catastrophic forgetting: a level of stability is kept via protecting parameters that greatly influence the prediction against radical changes, while allowing the rest of the parameters to change without restriction (Li & Hoiem, 2016; Lee et al, 2017; Zenke et al, 2017; Chaudhry et al, 2018; Kim et al, 2018; Nguyen et al, 2018; Srivastava et al, 2013; Schwarz et al, 2018; Vuorio et al, 2018; Aljundi et al, 2019c). The elastic weight consolidation (EWC) algorithm by Kirkpatrick et al (2017) is a seminal example, where a quadratic penalty is imposed on the difference between parameter values of the old and new tasks. One limitation is the high level of hand tuning required. (b) The architecture-based approach aims to deal with stability and adaptation issues by a fixed division of the architecture into global and local parts (Rusu et al, 2016b; Fernando et al, 2017; Shin et al, 2017; Kaplanis et al, 2018; Xu & Zhu, 2018; Yoon et al, 2018; Li et al, 2019b). (c) The memory-based approach relies on episodic memory to store data (or pseudo-data) from previous tasks (Ratcliff, 1990; Robins, 1993; 1995; Thrun, 1996; Schmidhuber, 2013; Hattori, 2014; Mocanu et al, 2016; Rebuffi et al, 2017; Kamra et al, 2017; Shin et al, 2017; Rolnick et al, 2018; van de Ven & Tolias, 2018; Wu et al, 2018; Titsias et al, 2019). Limitations include overheads for tasks such as data storage, replay, and optimisation to select (or generate) the points. CLAW can as well be seen as a combination of a regularisation-based approach (the variational inference mechanism) and a modelling approach which automates the architecture building process in a data-driven manner, avoiding the overhead resulting from either storing or generating data points from previous tasks. CLAW is also orthogonal to (and simple to combine with, if needed) memory-based methods.
Funding
  • HZ acknowledges support from the DARPA XAI project, contract#FA87501720152 and an Nvidia GPU grant
  • RT acknowledges support by Google, Amazon, Improbable and EPSRC grants EP/M0269571 and EP/L000776/1
Study subjects and analysis
experiments on five datasets: 6
The experiments demonstrate the effectiveness of CLAW in achieving state-of-the-art continual learning results measured by classification accuracy and by the achieved reduction in catastrophic forgetting. We also perform ablations in Section D in the Appendix which exhibit the relevance of each of the proposed adaptation parameters.

We perform six experiments on five datasets
. The datasets in use are: MNIST (LeCun et al, 1998), notMNIST (Butalov, 2011), Fashion-MNIST (Xiao et al, 2017), Omniglot (Lake et al, 2011) and CIFAR-100 (Krizhevsky & Hinton, 2009)

experiments on five datasets: 6
(3) The ability to combine our modelling and inference approaches without any significant augmentation of the architecture (no new neurons are needed). (4) State-of-the-art results in six experiments on five datasets, which demonstrate the effectiveness of our framework in terms of overall accuracy and reducing catastrophic forgetting. BACKGROUND ON VARIATIONAL CONTINUAL LEARNING (VCL)

experiments on five datasets: 6
We also perform ablations in Section D in the Appendix which exhibit the relevance of each of the proposed adaptation parameters. We perform six experiments on five datasets. The datasets in use are: MNIST (LeCun et al, 1998), notMNIST (Butalov, 2011), Fashion-MNIST (Xiao et al, 2017), Omniglot (Lake et al, 2011) and CIFAR-100 (Krizhevsky & Hinton, 2009)

datasets: 5
Average test classification accuracy vs. the number of observed tasks in 6 experiments. CLAW achieves significantly higher classification results than the competing continual learning frameworks. Statistical significance values are presented in Section E in the Appendix. The value of λ for EWC is 10,000 in (c), and 100 in the other experiments. Best viewed in colour. Evaluating catastrophic forgetting by measuring performance retention. Classification accuracy of the initial task is monitored along with the progression of tasks. Results are displayed for five datasets. CLAW is the least forgetful algorithm since performance levels achieved on the initial task do not degrade as much as in the other methods after facing new tasks. The legend and λ values for EWC are the same as in Figure 1. Best viewed in colour. Evaluating Forward transfer, or to what extent a continual learning framework can avoid negative transfer. The impact of learning previous tasks on a specific task (the last task) is inspected and used as a proxy for evaluating forward transfer. This is performed by evaluating the relative performance achieved on a unique task after learning a varying number of previous tasks. This means that the value at x-axis = 1 refers to the learning accuracy of the last task after having learnt solely one task (only itself), the value at 2 refers to the learning accuracy of the last task after having learnt two tasks (an additional previous task), etc. Overall, CLAW achieves state-of-the-art results in 4 out of the 5 experiments (at par in the fifth) in terms of avoiding negative transfer. Best viewed in colour

Your rating :
0

 

Tags
Comments