Coresets via Bilevel Optimization for Continual Learning and Streaming

NIPS 2020, 2020.

Cited by: 1|Bibtex|Views48
EI
Other Links: arxiv.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
We propose a novel coreset construction via cardinality-constrained bilevel optimization

Abstract:

Coresets are small data summaries that are sufficient for model training. They can be maintained online, enabling efficient handling of large data streams under resource constraints. However, existing constructions are limited to simple models such as k-means and logistic regression. In this work, we propose a novel coreset construction...More

Code:

Data:

0
Introduction
  • More and more applications rely on predictive models that are learnt online. A crucial, and in general open problem is to reliably maintain accurate models as data arrives over time.
  • The data arrives sequentially and the notion of task is not defined
  • For such practically important settings where data arrives in a non-iid manner, the performance of models can degrade arbitrarily.
  • This is especially problematic in the non-convex setting of deep learning, where this phenomenon is referred to as catastrophic forgetting [43, 22].
  • A crucial tool that allows them apply first-order methods by enabling the calculation of the gradient of G with respect to w is the implicit function theorem applied to
Highlights
  • More and more applications rely on predictive models that are learnt online
  • While the latter work uses a similar strategy for sensor subset selection to ours, we investigate the different setting of weighted data summarization for continual learning and streaming with neural networks
  • We demonstrate how our coreset construction can achieve significant performance gains in continual learning and in the more challenging streaming settings with neural networks
  • Whereas coresets outperform uniform sampling and active learning by a large margin, the RBF kernel is a surprisingly good proxy, by matching the performance of the convolutional neural tangent kernel (CNTK) and achieving above 97% test accuracy when trained on 250 chosen samples
  • We presented a novel framework for coreset generation based on bilevel optimization with cardinality constraints
  • Our method significantly outperforms reservoir sampling on average test accuracy
  • We showed that our method yields representative data summaries for neural networks and illustrated its advantages in alleviating catastrophic forgetting in continual learning and streaming deep learning, where our coreset construction substantially outperforms existing summarization strategies
Methods
  • Uniform sampling k-means of features k-center of embeddings

    Hardest samples iCaRL’s selection Coreset

    Streaming coreset, train at the end Reservoir sampling Streaming coreset

    VCL, k-center VCL, uniform VCL, coreset PermMNIST SplitMNIST

    SplitFashionMNIST

    6.2 Continual Learning

    The authors validate the method in the replay memory-based approach to continual learning.
  • The results in Table 6 confirm that the method has better training performance on the stream by 4%, but represents past tasks better than reservoir sampling by a large margin.
  • The imbalanced stream is created by splitting CIFAR-10 into 5 tasks, where each task consist of distinguishing between two consecutive classes of images, and retaining 200 random samples from the first four tasks and 2000 from the last task.
  • The authors evaluate the test accuracy on the tasks individually, where the authors do not undersample the test set
Results
  • Whereas coresets outperform uniform sampling and active learning by a large margin, the RBF kernel is a surprisingly good proxy, by matching the performance of the CNTK and achieving above 97% test accuracy when trained on 250 chosen samples.
  • The authors can notice that in each step, the method selects the sample that has the potential to increase the accuracy by the largest amount: the first 10 samples are picked from different classes, after which the method diversifies within the classes.
  • The authors' method significantly outperforms reservoir sampling on average test accuracy
Conclusion
  • The authors presented a novel framework for coreset generation based on bilevel optimization with cardinality constraints.
  • The authors showed that the method yields representative data summaries for neural networks and illustrated its advantages in alleviating catastrophic forgetting in continual learning and streaming deep learning, where the coreset construction substantially outperforms existing summarization strategies
Summary
  • Introduction:

    More and more applications rely on predictive models that are learnt online. A crucial, and in general open problem is to reliably maintain accurate models as data arrives over time.
  • The data arrives sequentially and the notion of task is not defined
  • For such practically important settings where data arrives in a non-iid manner, the performance of models can degrade arbitrarily.
  • This is especially problematic in the non-convex setting of deep learning, where this phenomenon is referred to as catastrophic forgetting [43, 22].
  • A crucial tool that allows them apply first-order methods by enabling the calculation of the gradient of G with respect to w is the implicit function theorem applied to
  • Objectives:

    The authors first present the coreset construction given a fixed dataset X = {(xi, yi)}ni=1.2 The authors consider a weighted variant of empirical risk minimization (ERM) where the goal is to minimize L(θ, w) =.
  • The authors' goal is to apply the coreset construction to complex models like deep neural networks.
  • The authors' goal is to compare the method to other data summarization strategies for managing the replay memory
  • Methods:

    Uniform sampling k-means of features k-center of embeddings

    Hardest samples iCaRL’s selection Coreset

    Streaming coreset, train at the end Reservoir sampling Streaming coreset

    VCL, k-center VCL, uniform VCL, coreset PermMNIST SplitMNIST

    SplitFashionMNIST

    6.2 Continual Learning

    The authors validate the method in the replay memory-based approach to continual learning.
  • The results in Table 6 confirm that the method has better training performance on the stream by 4%, but represents past tasks better than reservoir sampling by a large margin.
  • The imbalanced stream is created by splitting CIFAR-10 into 5 tasks, where each task consist of distinguishing between two consecutive classes of images, and retaining 200 random samples from the first four tasks and 2000 from the last task.
  • The authors evaluate the test accuracy on the tasks individually, where the authors do not undersample the test set
  • Results:

    Whereas coresets outperform uniform sampling and active learning by a large margin, the RBF kernel is a surprisingly good proxy, by matching the performance of the CNTK and achieving above 97% test accuracy when trained on 250 chosen samples.
  • The authors can notice that in each step, the method selects the sample that has the potential to increase the accuracy by the largest amount: the first 10 samples are picked from different classes, after which the method diversifies within the classes.
  • The authors' method significantly outperforms reservoir sampling on average test accuracy
  • Conclusion:

    The authors presented a novel framework for coreset generation based on bilevel optimization with cardinality constraints.
  • The authors showed that the method yields representative data summaries for neural networks and illustrated its advantages in alleviating catastrophic forgetting in continual learning and streaming deep learning, where the coreset construction substantially outperforms existing summarization strategies
Tables
  • Table1: Upper and middle: Continual learning and streaming with replay memory of size 100. Lower: VCL with 20 summary points / task. We report the average test accuracy over the tasks with std. over 5 runs with different random seeds. The methods using our coreset construction dominate
  • Table2: Imbalanced streaming on SplitMNIST and
  • Table3: Continual learning and streaming with replay memory of size 100. We report the average test accuracy over the tasks with standard deviation calculated over 5 runs with different random seeds. Methods using our coreset construction dominate
  • Table4: Replay memory size study on SplitMNIST. Our method offers bigger improvements with smaller memory sizes
  • Table5: RBF vs CNTK on SplitMNIST. The two kernels offer similar performance
  • Table6: Imbalanced streaming on SplitMNIST. Our method outperforms reservoir sampling by a large margin both on train accuracy on the stream and on average test accuracy on the tasks
  • Table7: Imbalanced streaming on the split version of CIFAR-10. Our method significantly outperforms reservoir sampling on average test accuracy on the tasks
Download tables as Excel
Related work
  • Continual Learning and Streaming. Continual learning with neural networks has received an increasing interest recently. The approaches for alleviating catastrophic forgetting fall into three main categories: using weight regularization to restrict deviation from parameters learned on old tasks [31, 44]; architectural adaptations for the tasks [49]; and replay-based approaches, where samples from old tasks are either reproduced via a replay memory [39] or via generative models [51]. In Preprint.

    this work, we focus on the replay-based approach, which provides strong empirical performance [9], despite its simplicity. In contrast, the more challenging setting of streaming using neural networks has received little attention. To the best of our knowledge, the replay-based approach to streaming has been tackled by [2, 26], which we compare against experimentally.
Funding
  • This research was supported by the SNSF grant 407540_167212 through the NRP 75 Big Data program and by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme grant agreement No 815943
Reference
  • R. Aljundi, K. Kelchtermans, and T. Tuytelaars. Task-free continual learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11254–11263, 2019.
    Google ScholarLocate open access versionFindings
  • R. Aljundi, M. Lin, B. Goujaud, and Y. Bengio. Gradient based sample selection for online continual learning. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 11816–11825. Curran Associates, Inc., 2019.
    Google ScholarLocate open access versionFindings
  • S. Arora, S. S. Du, W. Hu, Z. Li, R. R. Salakhutdinov, and R. Wang. On exact computation with an infinitely wide neural net. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8139–8148. Curran Associates, Inc., 2019.
    Google ScholarLocate open access versionFindings
  • D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, page 1027–1035, USA, 2006. Society for Industrial and Applied Mathematics.
    Google ScholarLocate open access versionFindings
  • J. F. Bard. Practical Bilevel Optimization: Algorithms and Applications. Springer, 1998.
    Google ScholarFindings
  • A. A. Bian, J. M. Buhmann, A. Krause, and S. Tschiatschek. Guarantees for greedy maximization of non-submodular functions with applications. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 498–507, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
    Google ScholarLocate open access versionFindings
  • T. Campbell and T. Broderick. Automated scalable bayesian inference via hilbert coresets. The Journal of Machine Learning Research, 20(1):551–588, 2019.
    Google ScholarLocate open access versionFindings
  • K. Chaloner and I. Verdinelli. Bayesian experimental design: A review. Statist. Sci., 10(3):273– 304, 08 1995.
    Google ScholarLocate open access versionFindings
  • A. Chaudhry, M. Rohrbach, M. Elhoseiny, T. Ajanthan, P. K. Dokania, P. H. Torr, and M. Ranzato. Continual learning with tiny episodic memories. arXiv preprint arXiv:1902.10486, 2019.
    Findings
  • B. Chazelle and J. Matoušek. On linear-time deterministic algorithms for optimization problems in fixed dimension. Journal of Algorithms, 21(3):579–597, 1996.
    Google ScholarLocate open access versionFindings
  • C. Coleman, C. Yeh, S. Mussmann, B. Mirzasoleiman, P. Bailis, P. Liang, J. Leskovec, and M. Zaharia. Selection via proxy: Efficient data selection for deep learning. In International Conference on Learning Representations, 2020.
    Google ScholarLocate open access versionFindings
  • R. D. Cook and S. Weisberg. Characterizations of an empirical influence function for detecting influential cases in regression. Technometrics, 22(4):495–508, 1980.
    Google ScholarLocate open access versionFindings
  • R. D. Cook and S. Weisberg. Residuals and influence in regression. New York: Chapman and Hall, 1982.
    Google ScholarFindings
  • A. Das and D. Kempe. Submodular meets spectral: greedy algorithms for subset selection, sparse approximation and dictionary selection. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pages 1057–1064, 2011.
    Google ScholarLocate open access versionFindings
  • S. Fanello, C. Ciliberto, M. Santoro, L. Natale, G. Metta, L. Rosasco, and F. Odone. icub world: Friendly robots help building good vision data-sets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 700–705, 2013.
    Google ScholarLocate open access versionFindings
  • S. Farquhar and Y. Gal. Towards robust evaluations of continual learning. arXiv preprint arXiv:1805.09733, 2018.
    Findings
  • V. V. Fedorov. Theory of optimal experiments. Probability and mathematical statistics. Academic Press, New York, NY, USA, 1972.
    Google ScholarFindings
  • D. Feldman and M. Langberg. A unified framework for approximating and clustering data. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 569–578. ACM, 2011.
    Google ScholarLocate open access versionFindings
  • C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1126–1135. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • L. Franceschi, P. Frasconi, S. Salzo, R. Grazzi, and M. Pontil. Bilevel programming for hyperparameter optimization and meta-learning. arXiv preprint arXiv:1806.04910, 2018.
    Findings
  • M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval research logistics quarterly, 3(1-2):95–110, 1956.
    Google ScholarLocate open access versionFindings
  • R. M. French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135, 1999.
    Google ScholarLocate open access versionFindings
  • S. Ghadimi and W. Mengdi. Approximation methods for bilevel programming. arXiv:1802.02246, 2018.
    Findings
  • I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y. Bengio. An empirical investigation of catastrophic forgetting in gradient-based neural networks, 2014.
    Google ScholarFindings
  • C. Harshaw, M. Feldman, J. Ward, and A. Karbasi. Submodular maximization beyond nonnegativity: Guarantees, fast algorithms, and applications. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2634–2643, Long Beach, California, USA, 09–15 Jun 2019. PMLR.
    Google ScholarLocate open access versionFindings
  • T. L. Hayes, N. D. Cahill, and C. Kanan. Memory efficient experience replay for streaming learning. In International Conference on Robotics and Automation (ICRA). IEEE, 2019.
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770– 778, 2016.
    Google ScholarLocate open access versionFindings
  • J. Huggins, T. Campbell, and T. Broderick. Coresets for scalable bayesian logistic regression. In Advances in Neural Information Processing Systems, pages 4080–4088, 2016.
    Google ScholarLocate open access versionFindings
  • A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8571–8580, 2018.
    Google ScholarLocate open access versionFindings
  • D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
    Google ScholarLocate open access versionFindings
  • J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
    Google ScholarLocate open access versionFindings
  • A. Kirsch, J. van Amersfoort, and Y. Gal. Batchbald: Efficient and diverse batch acquisition for deep bayesian active learning. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 7024–7035. Curran Associates, Inc., 2019.
    Google ScholarLocate open access versionFindings
  • P. W. Koh and P. Liang. Understanding black-box predictions via influence functions. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1885–1894. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • A. Krizhevsky et al. Learning multiple layers of features from tiny images, 2009.
    Google ScholarFindings
  • J. Lee, J. Sohl-dickstein, J. Pennington, R. Novak, S. Schoenholz, and Y. Bahri. Deep neural networks as gaussian processes. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • D. C. Liu and J. Nocedal. On the limited memory bfgs method for large scale optimization. Mathematical programming, 45(1-3):503–528, 1989.
    Google ScholarLocate open access versionFindings
  • H. Liu, K. Simonyan, and Y. Yang. DARTS: Differentiable architecture search. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • F. Locatello, M. Tschannen, G. Rätsch, and M. Jaggi. Greedy algorithms for cone constrained optimization with convergence guarantees. In Advances in Neural Information Processing Systems, pages 773–784, 2017.
    Google ScholarLocate open access versionFindings
  • D. Lopez-Paz and M. Ranzato. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, pages 6467–6476, 2017.
    Google ScholarLocate open access versionFindings
  • J. Lorraine, P. Vicol, and D. Duvenaud. Optimizing millions of hyperparameters by implicit differentiation, 2019.
    Google ScholarFindings
  • M. Lucic, M. Faulkner, A. Krause, and D. Feldman. Training gaussian mixture models at scale via coresets. The Journal of Machine Learning Research, 18(1):5885–5909, 2017.
    Google ScholarLocate open access versionFindings
  • J. Luketina, M. Berglund, K. Greff, and T. Raiko. Scalable gradient-based tuning of continuous regularization hyperparameters. In International conference on machine learning, pages 2952– 2960, 2016.
    Google ScholarLocate open access versionFindings
  • M. McCloskey and N. J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pages 109–165.
    Google ScholarLocate open access versionFindings
  • C. V. Nguyen, Y. Li, T. D. Bui, and R. E. Turner. Variational continual learning. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • R. Novak, L. Xiao, J. Hron, J. Lee, A. A. Alemi, J. Sohl-Dickstein, and S. S. Schoenholz. Neural tangents: Fast and easy infinite neural networks in python. In International Conference on Learning Representations, 2020.
    Google ScholarLocate open access versionFindings
  • F. Pedregosa. Hyperparameter optimization with approximate gradient. In International Conference on Machine Learning, pages 737–746, 2016.
    Google ScholarLocate open access versionFindings
  • S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2001–2010, 2017.
    Google ScholarLocate open access versionFindings
  • M. Ren, W. Zeng, B. Yang, and R. Urtasun. Learning to reweight examples for robust deep learning. In International Conference on Machine Learning, pages 4331–4340, 2018.
    Google ScholarLocate open access versionFindings
  • A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016.
    Findings
  • O. Sener and S. Savarese. Active learning for convolutional neural networks: A core-set approach. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • H. Shin, J. K. Lee, J. Kim, and J. Kim. Continual learning with deep generative replay. In Advances in Neural Information Processing Systems, pages 2990–2999, 2017.
    Google ScholarLocate open access versionFindings
  • J. Tapia, E. Knoop, M. Mutný, M. A. Otaduy, and M. Bächer. Makesense: Automated sensor design for proprioceptive soft robots. Soft Robotics, 2019. PMID: 31891526.
    Locate open access versionFindings
  • R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
    Google ScholarLocate open access versionFindings
  • M. K. Titsias, J. Schwarz, A. G. de G. Matthews, R. Pascanu, and Y. W. Teh. Functional regularisation for continual learning with gaussian processes. In International Conference on Learning Representations, 2020.
    Google ScholarLocate open access versionFindings
  • L. N. Vicente and P. H. Calamai. Bilevel and multilevel programming: A bibliography review. Journal of Global optimization, 5(3):291–306, 1994.
    Google ScholarLocate open access versionFindings
  • J. S. Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS), 11(1):37–57, 1985.
    Google ScholarLocate open access versionFindings
  • K. Wei, R. Iyer, and J. Bilmes. Submodularity in data subset selection and active learning. In F. Bach and D. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1954–1963, Lille, France, 07–09 Jul 2015. PMLR.
    Google ScholarLocate open access versionFindings
  • H. Xiao, K. Rasul, and R. Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017.
    Google ScholarFindings
  • F. Zenke, B. Poole, and S. Ganguli. Continual learning through synaptic intelligence. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3987–3995, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments