Training Stronger Baselines for Learning to Optimize
NIPS 2020, (2020)
Learning to optimize (L2O) has gained increasing attention since classical optimizers require laborious problem-specific design and hyperparameter tuning. However, there is a gap between the practical demand and the achievable performance of existing L2O models. Specifically, those learned optimizers are applicable to only a limited cla...更多
下载 PDF 全文
- Learning to optimize (L2O) [1,2,3,4,5,6,7,8,9,10], a rising sub-field of meta
Parameter Update learning, aims to replace manually designed analytical optimizers with learned optimizers, i.e., update rules as functions that can be fit from data.
- Optimizee Variables a model to parameterize the target update rule.
- L2O model will act as an algorithm itself, that can be applied to training other machine learning models, called optimizees, sampled from a specific class of similar problem.
- The training of the L2O model is usually done in a meta-fashion, by enforcing it to decrease the loss values over sampled optimizees from the same class, via certain training techniques.
- That LSTM is unrolled to mimic the behavior of an iterative optimizer and trained
- Learning to optimize (L2O) [1,2,3,4,5,6,7,8,9,10], a rising sub-field of meta
Parameter Update learning, aims to replace manually designed analytical optimizers with learned optimizers, i.e., update rules as functions that can be fit from data
- 4.1 Imitation learning: multi-task regularization by analytical optimizers. We propose another L2O training method based on imitation of analytical optimizers behaviours, through a multi-task learning form, which is found to further stabilize our training, prevent overfitting, and improve the trained L2O models’ generalization
- Our improved training techniques can be further plugged into previous state-of-the-art L2O methods and yield extra performance boosts for them all
- We propose a set of improved training techniques to unleash the great potential of L2O models
- The contributions made in this work are of practical nature; we hope them to lay a solid and fair evaluation ground by offering strong baselines for the L2O community
- This paper proposes several improved training techniques to tackle the dilemma of training instability and poor generalization in learned optimizers
- Experiments and Analysis
the authors conduct systematic experiments to evaluate the proposed training techniques.
- The authors' results come with multiple independent runs, and the error bars are reported in the Appendix
- Results and Analysis
The results are presented in figure A2. The authors observe that the model trained by curriculum learning outperforms the two baselines (i.e., L2O-DM and L2O-DM-AUG) with fewer training iterations.
- Learning to optimize (L2O) is a promising field of meta learning that has so far been a bit held back by unstable L2O training and the poor generalization of learned optimizers.
- This work provides practical solutions to push this field forward.
- The authors propose a set of improved training techniques to unleash the great potential of L2O models.
- The contributions made in this work are of practical nature; the authors hope them to lay a solid and fair evaluation ground by offering strong baselines for the L2O community.
- Learning to Optimize L2O uses a data-driven learned model as the optimizer, instead of handcrafted rules (e.g., SGD, RMSprop, and Adam).  was the first to leverage an LSTM as the coordinate-wise optimizer, which is fed with the optimizee gradients and outputs the optimizee parameter updates.  instead took the optimizee’s objective value history, as the input state of a reinforcement learning agent, which outputs the updates as actions. To train an L2O with better generalization and longer horizons,  proposes random scaling and convex function regularizers tricks. [8,23] introduce a hierarchical RNN to capture the relationship across the optimizee parameters and trains it via meta learning on the ensemble of small representative problems. Besides learning the full update rule, L2O was also customized to automatic hyperparamter tuning in specific tasks [24,25,26]. Curriculum Learning The idea  is to first focuses on learning from a subset of simple training examples, and gradually expanding to include the remaining harder samples. Curriculum learning often yields faster convergence and better generalization, especially when the training set is varied or noisy.  unifies it with self-paced learning.  automates the curriculum learning by employing a non-stationary multi-armed bandit algorithm with a reward of learning progress indicators. [30,31,32,33] describe a number of applications where curriculum learning plays important roles. Imitation Learning Imitation learning [34, 35], also known as "learning from demonstration", is to imitate an expert demonstration instead of learning from rewards as in reinforcement learning.
representative optimizees: 5
L2ODM-CL-IL denotes the enhanced L2O model. Learned optimizers are evaluated on five representative optimizees and the corresponding optimizee training loss are collected in Figure 2. From the results in Figure 2, we observe that the previously noncompetitive L2O-DM, that initially even cannot stably converge on the Optimizee i) at long horizons, now consistently and largely outperforms over all previous SOTA L2O methods: RNNprop, L2O-Scale, and L2O-Scale-Meta, by decreasing the objective loss value much lower
- Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, 2016.
- Samy Bengio, Yoshua Bengio, and Jocelyn Cloutier. On the search for new learning rules for ANNs. Neural Processing Letters, 2(4):26–30, 1995.
- Yoshua Bengio, Samy Bengio, and Jocelyn Cloutier. Learning a synaptic learning rule. Université de Montréal, Département d’informatique et de recherche..., 1990.
- A Steven Younger, Peter R Conwell, and Neil E Cotter. Fixed-weight on-line learning. IEEE Transactions on Neural Networks, 10(2):272–283, 1999.
- Sepp Hochreiter, A Steven Younger, and Peter R Conwell. Learning to learn using gradient descent. In International Conference on Artificial Neural Networks, pages 87–94.
- Yutian Chen, Matthew W Hoffman, Sergio Gómez Colmenarejo, Misha Denil, Timothy P Lillicrap, Matt Botvinick, and Nando de Freitas. Learning to learn without gradient descent by gradient descent. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 748–75JMLR. org, 2017.
- Kaifeng Lv, Shunhua Jiang, and Jian Li. Learning gradient descent: Better generalization and longer horizons. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 2247–2255. JMLR. org, 2017.
- Olga Wichrowska, Niru Maheswaranathan, Matthew W Hoffman, Sergio Gomez Colmenarejo, Misha Denil, Nando de Freitas, and Jascha Sohl-Dickstein. Learned optimizers that scale and generalize. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3751–3760. JMLR. org, 2017.
- Yue Cao, Tianlong Chen, Zhangyang Wang, and Yang Shen. Learning to optimize in swarms. In Advances in Neural Information Processing Systems, pages 15018–15028, 2019.
- Zhaohui Yang, Yunhe Wang, Kai Han, Chunjing Xu, Chao Xu, Dacheng Tao, and Chang Xu. Searching for low-bit weights in quantized neural networks. arXiv preprint arXiv:2009.08695, 2020.
- David E Goldberg and John Henry Holland. Genetic algorithms and machine learning. 1988.
- Samy Bengio, Yoshua Bengio, Jocelyn Cloutier, and Jan Gecsei. On the optimization of a synaptic learning rule. In Preprints Conf. Optimality in Artificial and Biological Neural Networks, pages 6–8. Univ. of Texas, 1992.
- James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. Journal of machine learning research, 13(Feb):281–305, 2012.
- Irwan Bello, Barret Zoph, Vijay Vasudevan, and Quoc V Le. Neural optimizer search with reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 459–468. JMLR. org, 2017.
- Ke Li and Jitendra Malik. Learning to optimize. arXiv preprint arXiv:1606.01885, 2016.
- Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pages 2951–2959, 2012.
- Luke Metz, Niru Maheswaranathan, Jeremy Nixon, C Daniel Freeman, and Jascha SohlDickstein. Understanding and correcting pathologies in the training of learned optimizers. arXiv preprint arXiv:1810.10180, 2018.
- Corentin Tallec and Yann Ollivier. Unbiasing truncated backpropagation through time. arXiv preprint arXiv:1705.08209, 2017.
- Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In International conference on machine learning, pages 1310–1318, 2013.
- Paavo Parmas, Carl Edward Rasmussen, Jan Peters, and Kenji Doya. Pipps: Flexible modelbased policy search robust to the curse of chaos. arXiv preprint arXiv:1902.01240, 2019.
- Xinshi Chen, Yu Li, Ramzan Umarov, Xin Gao, and Le Song. Rna secondary structure prediction by learning unrolled algorithms. In International Conference on Learning Representations, 2019.
- Xinshi Chen, Hanjun Dai, Yu Li, Xin Gao, and Le Song. Learning to stop while learning to predict. arXiv preprint arXiv:2006.05082, 2020.
- Chaojian Li, Tianlong Chen, Haoran You, Zhangyang Wang, and Yingyan Lin. Halo: Hardwareaware learning to optimize. In Proceedings of the European Conference on Computer Vision (ECCV), September 2020.
- Yuning You, Tianlong Chen, Zhangyang Wang, and Yang Shen. L2-gcn: Layer-wise and learned efficient training of graph convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2127–2135, 2020.
- Wuyang Chen, Zhiding Yu, Zhangyang Wang, and Anima Anandkumar. Automated syntheticto-real generalization. International Conference on Machine Learning (ICML), 2020.
- Xuxi Chen, Wuyang Chen, Tianlong Chen, Ye Yuan, Chen Gong, Kewei Chen, and Zhangyang Wang. Self-pu: Self boosted and calibrated positive-unlabeled training. International Conference on Machine Learning (ICML), 2020.
- Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48, 2009.
- Lu Jiang, Deyu Meng, Qian Zhao, Shiguang Shan, and Alexander G Hauptmann. Self-paced curriculum learning. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
- Alex Graves, Marc G Bellemare, Jacob Menick, Remi Munos, and Koray Kavukcuoglu. Automated curriculum learning for neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1311–1320. JMLR. org, 2017.
- Wojciech Zaremba and Ilya Sutskever. Learning to execute. arXiv preprint arXiv:1410.4615, 2014.
- Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pages 1171–1179, 2015.
- M Pawan Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning for latent variable models. In Advances in Neural Information Processing Systems, pages 1189–1197, 2010.
- Carlos Florensa, David Held, Markus Wulfmeier, Michael Zhang, and Pieter Abbeel. Reverse curriculum generation for reinforcement learning. arXiv preprint arXiv:1707.05300, 2017.
- Stefan Schaal. Is imitation learning the route to humanoid robots? Trends in cognitive sciences, 3(6):233–242, 1999.
- Stefan Schaal, Auke Ijspeert, and Aude Billard. Computational approaches to motor learning by imitation. Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences, 358(1431):537–547, 2003.
- Michael Zhang, James Lucas, Jimmy Ba, and Geoffrey E Hinton. Lookahead optimizer: k steps forward, 1 step back. In Advances in Neural Information Processing Systems, pages 9593–9604, 2019.
- Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8(3-4):293–321, 1992.
- Binghong Chen, Bo Dai, Qinjie Lin, Guo Ye, Han Liu, and Le Song. Learning to plan in high dimensions via neural exploration-exploitation trees. In International Conference on Learning Representations, 2020.
- Melanie Coggan. Exploration and exploitation in reinforcement learning. Research supervised by Prof. Doina Precup, CRA-W DMP Project at McGill University, 2004.
- Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Hamid Reza Maei, Csaba Szepesvári, Shalabh Bhatnagar, and Richard S Sutton. Toward off-policy learning control with function approximation. In ICML, 2010.
- Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Xuanyi Dong and Yi Yang. Nas-bench-201: Extending the scope of reproducible neural architecture search. In International Conference on Learning Representations, 2020. Supplementary Materials: Training Stronger Baselines for Learning to Optimize arXiv:2010.09089v1 [cs.LG] 18 Oct 2020