# Coresets via Bilevel Optimization for Continual Learning and Streaming

NIPS 2020, 2020.

EI

Weibo:

Abstract:

Coresets are small data summaries that are sufficient for model training. They can be maintained online, enabling efficient handling of large data streams under resource constraints. However, existing constructions are limited to simple models such as k-means and logistic regression. In this work, we propose a novel coreset construction...More

Code:

Data:

Introduction

- More and more applications rely on predictive models that are learnt online. A crucial, and in general open problem is to reliably maintain accurate models as data arrives over time.
- The data arrives sequentially and the notion of task is not defined
- For such practically important settings where data arrives in a non-iid manner, the performance of models can degrade arbitrarily.
- This is especially problematic in the non-convex setting of deep learning, where this phenomenon is referred to as catastrophic forgetting [43, 22].
- A crucial tool that allows them apply first-order methods by enabling the calculation of the gradient of G with respect to w is the implicit function theorem applied to

Highlights

- More and more applications rely on predictive models that are learnt online
- While the latter work uses a similar strategy for sensor subset selection to ours, we investigate the different setting of weighted data summarization for continual learning and streaming with neural networks
- We demonstrate how our coreset construction can achieve significant performance gains in continual learning and in the more challenging streaming settings with neural networks
- Whereas coresets outperform uniform sampling and active learning by a large margin, the RBF kernel is a surprisingly good proxy, by matching the performance of the convolutional neural tangent kernel (CNTK) and achieving above 97% test accuracy when trained on 250 chosen samples
- We presented a novel framework for coreset generation based on bilevel optimization with cardinality constraints
- Our method significantly outperforms reservoir sampling on average test accuracy
- We showed that our method yields representative data summaries for neural networks and illustrated its advantages in alleviating catastrophic forgetting in continual learning and streaming deep learning, where our coreset construction substantially outperforms existing summarization strategies

Methods

- Uniform sampling k-means of features k-center of embeddings

Hardest samples iCaRL’s selection Coreset

Streaming coreset, train at the end Reservoir sampling Streaming coreset

VCL, k-center VCL, uniform VCL, coreset PermMNIST SplitMNIST

SplitFashionMNIST

6.2 Continual Learning

The authors validate the method in the replay memory-based approach to continual learning. - The results in Table 6 confirm that the method has better training performance on the stream by 4%, but represents past tasks better than reservoir sampling by a large margin.
- The imbalanced stream is created by splitting CIFAR-10 into 5 tasks, where each task consist of distinguishing between two consecutive classes of images, and retaining 200 random samples from the first four tasks and 2000 from the last task.
- The authors evaluate the test accuracy on the tasks individually, where the authors do not undersample the test set

Results

- Whereas coresets outperform uniform sampling and active learning by a large margin, the RBF kernel is a surprisingly good proxy, by matching the performance of the CNTK and achieving above 97% test accuracy when trained on 250 chosen samples.
- The authors can notice that in each step, the method selects the sample that has the potential to increase the accuracy by the largest amount: the first 10 samples are picked from different classes, after which the method diversifies within the classes.
- The authors' method significantly outperforms reservoir sampling on average test accuracy

Conclusion

- The authors presented a novel framework for coreset generation based on bilevel optimization with cardinality constraints.
- The authors showed that the method yields representative data summaries for neural networks and illustrated its advantages in alleviating catastrophic forgetting in continual learning and streaming deep learning, where the coreset construction substantially outperforms existing summarization strategies

Summary

## Introduction:

More and more applications rely on predictive models that are learnt online. A crucial, and in general open problem is to reliably maintain accurate models as data arrives over time.- The data arrives sequentially and the notion of task is not defined
- For such practically important settings where data arrives in a non-iid manner, the performance of models can degrade arbitrarily.
- This is especially problematic in the non-convex setting of deep learning, where this phenomenon is referred to as catastrophic forgetting [43, 22].
- A crucial tool that allows them apply first-order methods by enabling the calculation of the gradient of G with respect to w is the implicit function theorem applied to
## Objectives:

The authors first present the coreset construction given a fixed dataset X = {(xi, yi)}ni=1.2 The authors consider a weighted variant of empirical risk minimization (ERM) where the goal is to minimize L(θ, w) =.- The authors' goal is to apply the coreset construction to complex models like deep neural networks.
- The authors' goal is to compare the method to other data summarization strategies for managing the replay memory
## Methods:

Uniform sampling k-means of features k-center of embeddings

Hardest samples iCaRL’s selection Coreset

Streaming coreset, train at the end Reservoir sampling Streaming coreset

VCL, k-center VCL, uniform VCL, coreset PermMNIST SplitMNIST

SplitFashionMNIST

6.2 Continual Learning

The authors validate the method in the replay memory-based approach to continual learning.- The results in Table 6 confirm that the method has better training performance on the stream by 4%, but represents past tasks better than reservoir sampling by a large margin.
- The imbalanced stream is created by splitting CIFAR-10 into 5 tasks, where each task consist of distinguishing between two consecutive classes of images, and retaining 200 random samples from the first four tasks and 2000 from the last task.
- The authors evaluate the test accuracy on the tasks individually, where the authors do not undersample the test set
## Results:

Whereas coresets outperform uniform sampling and active learning by a large margin, the RBF kernel is a surprisingly good proxy, by matching the performance of the CNTK and achieving above 97% test accuracy when trained on 250 chosen samples.- The authors can notice that in each step, the method selects the sample that has the potential to increase the accuracy by the largest amount: the first 10 samples are picked from different classes, after which the method diversifies within the classes.
- The authors' method significantly outperforms reservoir sampling on average test accuracy
## Conclusion:

The authors presented a novel framework for coreset generation based on bilevel optimization with cardinality constraints.- The authors showed that the method yields representative data summaries for neural networks and illustrated its advantages in alleviating catastrophic forgetting in continual learning and streaming deep learning, where the coreset construction substantially outperforms existing summarization strategies

- Table1: Upper and middle: Continual learning and streaming with replay memory of size 100. Lower: VCL with 20 summary points / task. We report the average test accuracy over the tasks with std. over 5 runs with different random seeds. The methods using our coreset construction dominate
- Table2: Imbalanced streaming on SplitMNIST and
- Table3: Continual learning and streaming with replay memory of size 100. We report the average test accuracy over the tasks with standard deviation calculated over 5 runs with different random seeds. Methods using our coreset construction dominate
- Table4: Replay memory size study on SplitMNIST. Our method offers bigger improvements with smaller memory sizes
- Table5: RBF vs CNTK on SplitMNIST. The two kernels offer similar performance
- Table6: Imbalanced streaming on SplitMNIST. Our method outperforms reservoir sampling by a large margin both on train accuracy on the stream and on average test accuracy on the tasks
- Table7: Imbalanced streaming on the split version of CIFAR-10. Our method significantly outperforms reservoir sampling on average test accuracy on the tasks

Related work

- Continual Learning and Streaming. Continual learning with neural networks has received an increasing interest recently. The approaches for alleviating catastrophic forgetting fall into three main categories: using weight regularization to restrict deviation from parameters learned on old tasks [31, 44]; architectural adaptations for the tasks [49]; and replay-based approaches, where samples from old tasks are either reproduced via a replay memory [39] or via generative models [51]. In Preprint.

this work, we focus on the replay-based approach, which provides strong empirical performance [9], despite its simplicity. In contrast, the more challenging setting of streaming using neural networks has received little attention. To the best of our knowledge, the replay-based approach to streaming has been tackled by [2, 26], which we compare against experimentally.

Funding

- This research was supported by the SNSF grant 407540_167212 through the NRP 75 Big Data program and by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme grant agreement No 815943

Reference

- R. Aljundi, K. Kelchtermans, and T. Tuytelaars. Task-free continual learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11254–11263, 2019.
- R. Aljundi, M. Lin, B. Goujaud, and Y. Bengio. Gradient based sample selection for online continual learning. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 11816–11825. Curran Associates, Inc., 2019.
- S. Arora, S. S. Du, W. Hu, Z. Li, R. R. Salakhutdinov, and R. Wang. On exact computation with an infinitely wide neural net. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8139–8148. Curran Associates, Inc., 2019.
- D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, page 1027–1035, USA, 2006. Society for Industrial and Applied Mathematics.
- J. F. Bard. Practical Bilevel Optimization: Algorithms and Applications. Springer, 1998.
- A. A. Bian, J. M. Buhmann, A. Krause, and S. Tschiatschek. Guarantees for greedy maximization of non-submodular functions with applications. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 498–507, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
- T. Campbell and T. Broderick. Automated scalable bayesian inference via hilbert coresets. The Journal of Machine Learning Research, 20(1):551–588, 2019.
- K. Chaloner and I. Verdinelli. Bayesian experimental design: A review. Statist. Sci., 10(3):273– 304, 08 1995.
- A. Chaudhry, M. Rohrbach, M. Elhoseiny, T. Ajanthan, P. K. Dokania, P. H. Torr, and M. Ranzato. Continual learning with tiny episodic memories. arXiv preprint arXiv:1902.10486, 2019.
- B. Chazelle and J. Matoušek. On linear-time deterministic algorithms for optimization problems in fixed dimension. Journal of Algorithms, 21(3):579–597, 1996.
- C. Coleman, C. Yeh, S. Mussmann, B. Mirzasoleiman, P. Bailis, P. Liang, J. Leskovec, and M. Zaharia. Selection via proxy: Efficient data selection for deep learning. In International Conference on Learning Representations, 2020.
- R. D. Cook and S. Weisberg. Characterizations of an empirical influence function for detecting influential cases in regression. Technometrics, 22(4):495–508, 1980.
- R. D. Cook and S. Weisberg. Residuals and influence in regression. New York: Chapman and Hall, 1982.
- A. Das and D. Kempe. Submodular meets spectral: greedy algorithms for subset selection, sparse approximation and dictionary selection. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pages 1057–1064, 2011.
- S. Fanello, C. Ciliberto, M. Santoro, L. Natale, G. Metta, L. Rosasco, and F. Odone. icub world: Friendly robots help building good vision data-sets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 700–705, 2013.
- S. Farquhar and Y. Gal. Towards robust evaluations of continual learning. arXiv preprint arXiv:1805.09733, 2018.
- V. V. Fedorov. Theory of optimal experiments. Probability and mathematical statistics. Academic Press, New York, NY, USA, 1972.
- D. Feldman and M. Langberg. A unified framework for approximating and clustering data. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 569–578. ACM, 2011.
- C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1126–1135. JMLR. org, 2017.
- L. Franceschi, P. Frasconi, S. Salzo, R. Grazzi, and M. Pontil. Bilevel programming for hyperparameter optimization and meta-learning. arXiv preprint arXiv:1806.04910, 2018.
- M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval research logistics quarterly, 3(1-2):95–110, 1956.
- R. M. French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135, 1999.
- S. Ghadimi and W. Mengdi. Approximation methods for bilevel programming. arXiv:1802.02246, 2018.
- I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y. Bengio. An empirical investigation of catastrophic forgetting in gradient-based neural networks, 2014.
- C. Harshaw, M. Feldman, J. Ward, and A. Karbasi. Submodular maximization beyond nonnegativity: Guarantees, fast algorithms, and applications. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2634–2643, Long Beach, California, USA, 09–15 Jun 2019. PMLR.
- T. L. Hayes, N. D. Cahill, and C. Kanan. Memory efficient experience replay for streaming learning. In International Conference on Robotics and Automation (ICRA). IEEE, 2019.
- K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770– 778, 2016.
- J. Huggins, T. Campbell, and T. Broderick. Coresets for scalable bayesian logistic regression. In Advances in Neural Information Processing Systems, pages 4080–4088, 2016.
- A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8571–8580, 2018.
- D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
- J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
- A. Kirsch, J. van Amersfoort, and Y. Gal. Batchbald: Efficient and diverse batch acquisition for deep bayesian active learning. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 7024–7035. Curran Associates, Inc., 2019.
- P. W. Koh and P. Liang. Understanding black-box predictions via influence functions. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1885–1894. JMLR. org, 2017.
- A. Krizhevsky et al. Learning multiple layers of features from tiny images, 2009.
- J. Lee, J. Sohl-dickstein, J. Pennington, R. Novak, S. Schoenholz, and Y. Bahri. Deep neural networks as gaussian processes. In International Conference on Learning Representations, 2018.
- D. C. Liu and J. Nocedal. On the limited memory bfgs method for large scale optimization. Mathematical programming, 45(1-3):503–528, 1989.
- H. Liu, K. Simonyan, and Y. Yang. DARTS: Differentiable architecture search. In International Conference on Learning Representations, 2019.
- F. Locatello, M. Tschannen, G. Rätsch, and M. Jaggi. Greedy algorithms for cone constrained optimization with convergence guarantees. In Advances in Neural Information Processing Systems, pages 773–784, 2017.
- D. Lopez-Paz and M. Ranzato. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, pages 6467–6476, 2017.
- J. Lorraine, P. Vicol, and D. Duvenaud. Optimizing millions of hyperparameters by implicit differentiation, 2019.
- M. Lucic, M. Faulkner, A. Krause, and D. Feldman. Training gaussian mixture models at scale via coresets. The Journal of Machine Learning Research, 18(1):5885–5909, 2017.
- J. Luketina, M. Berglund, K. Greff, and T. Raiko. Scalable gradient-based tuning of continuous regularization hyperparameters. In International conference on machine learning, pages 2952– 2960, 2016.
- M. McCloskey and N. J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pages 109–165.
- C. V. Nguyen, Y. Li, T. D. Bui, and R. E. Turner. Variational continual learning. In International Conference on Learning Representations, 2018.
- R. Novak, L. Xiao, J. Hron, J. Lee, A. A. Alemi, J. Sohl-Dickstein, and S. S. Schoenholz. Neural tangents: Fast and easy infinite neural networks in python. In International Conference on Learning Representations, 2020.
- F. Pedregosa. Hyperparameter optimization with approximate gradient. In International Conference on Machine Learning, pages 737–746, 2016.
- S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2001–2010, 2017.
- M. Ren, W. Zeng, B. Yang, and R. Urtasun. Learning to reweight examples for robust deep learning. In International Conference on Machine Learning, pages 4331–4340, 2018.
- A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016.
- O. Sener and S. Savarese. Active learning for convolutional neural networks: A core-set approach. In International Conference on Learning Representations, 2018.
- H. Shin, J. K. Lee, J. Kim, and J. Kim. Continual learning with deep generative replay. In Advances in Neural Information Processing Systems, pages 2990–2999, 2017.
- J. Tapia, E. Knoop, M. Mutný, M. A. Otaduy, and M. Bächer. Makesense: Automated sensor design for proprioceptive soft robots. Soft Robotics, 2019. PMID: 31891526.
- R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
- M. K. Titsias, J. Schwarz, A. G. de G. Matthews, R. Pascanu, and Y. W. Teh. Functional regularisation for continual learning with gaussian processes. In International Conference on Learning Representations, 2020.
- L. N. Vicente and P. H. Calamai. Bilevel and multilevel programming: A bibliography review. Journal of Global optimization, 5(3):291–306, 1994.
- J. S. Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS), 11(1):37–57, 1985.
- K. Wei, R. Iyer, and J. Bilmes. Submodularity in data subset selection and active learning. In F. Bach and D. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1954–1963, Lille, France, 07–09 Jul 2015. PMLR.
- H. Xiao, K. Rasul, and R. Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017.
- F. Zenke, B. Poole, and S. Ganguli. Continual learning through synaptic intelligence. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3987–3995, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.

Full Text

Tags

Comments