ResNet After All: Neural ODEs and Their Numerical Solution

Katharina Ott
Katharina Ott
Prateek Katiyar
Prateek Katiyar
Michael Tiemann
Michael Tiemann

international conference on learning representations, 2020.

Cited by: 0|Views14
Weibo:
We explain why some Neural ODE models do not permit a continuous-depth interpretation after training and how to fix it.

Abstract:

A key appeal of the recently proposed Neural Ordinary Differential Equation (ODE) framework is that it seems to provide a continuous-time extension of discrete residual neural networks. As we show herein, though, trained Neural ODE models actually depend on the specific numerical method used during training. If the trained mo...More

Code:

Data:

0
Full Text
Bibtex
Weibo
Introduction
  • The choice of neural network architecture is an important consideration in the deep learning community.
  • Besides the architectural advancements inspired from the original scheme (Zagoruyko & Komodakis, 2016; Xie et al, 2017), recently Neural Ordinary Differential Equation (Neural ODE) models (Chen et al, 2018; E, 2017; Lu et al, 2018; Haber & Ruthotto, 2017) have been proposed as an analog of continuous-depth ResNets.
  • Inspired by the theoretical properties of the solution curves, they propose a regularizer which improved the robustness of Neural ODE models even further.
  • If Neural ODEs are chosen for their theoretical advantages, it is essential that the effective model—the combination of ODE problem and its solution via a particular numerical method—is a close approximation of the true analytical, but practically inaccessible ODE solution
Highlights
  • The choice of neural network architecture is an important consideration in the deep learning community
  • Besides the architectural advancements inspired from the original scheme (Zagoruyko & Komodakis, 2016; Xie et al, 2017), recently Neural Ordinary Differential Equation (Neural ODE) models (Chen et al, 2018; E, 2017; Lu et al, 2018; Haber & Ruthotto, 2017) have been proposed as an analog of continuous-depth Residual Neural Networks (ResNets)
  • While Neural ODEs do not necessarily improve upon the sheer predictive performance of ResNets, they offer the vast knowledge of ODE theory to be applied to deep learning research
  • As a first step we propose to check how robust the model is with respect to the step size/tolerance to ensure that the resulting model is in a regime where ODE-ness is guaranteed and one can apply reasoning from ODE theory to the model
  • We have shown that the step size of fixed step solvers and the tolerance for adaptive methods used for training Neural ODEs impacts whether the resulting model maintains properties of ODE solutions
  • We do not expect to achieve state-of-the-art results with this simple architecture but we expect our results to remain valid for more complicated architectures
  • We developed step size and tolerance adaptation algorithms, which maintain a continuous ODE interpretation throughout training
Methods
  • The authors introduce a classification task based on the concentric sphere dataset proposed by Dupont et al (2019).
  • For additional results on MNIST the authors refer to the Supplementary Material Section B
  • The aim of these experiments is to analyze the dynamics of Neural ODEs and show its dependence on the specific solver used during training by testing the model with a different solver configuration.
  • In the Supplementary Material Section C the authors provide additional experiments using the (Euler, rk4) and (Midpoint, rk4) as train and test solver pairs.
  • The authors observe this behavior for the tolerance adaptation algorithm for adaptive methods
Results
  • In the main text the authors do not plot the results for all the training step sizes/ tolerances but only for every second training step size/tolerance to improve the clarity of the plots.
  • The authors include the plots showing all training runs and the authors include additional results for all datasets.
  • RESULTS FOR FIXED STEP SOLVERS.
  • In addition to the concentric sphere dataset in 2 dimensions, the authors use this data set in higher dimensions 3, 10 and 900.
  • The authors present the results for fixed step solvers.
  • The model is trained with Euler’s method or a 4th order Runge Kutta method with different step sizes.
Conclusion
  • Trajectory crossing is a clear indication that the model violates ODE semantics. But even if the authors do not observe trajectory crossing for some step size hthis does not mean that for all h < hwe will no observe crossing trajectories (see Supplementary Material Section A for an example).

    The authors have just described two effects, trajectory crossing and Lady Windermere’s fan, which can lead to a drop in performance when the model is tested with a different solver.
  • The authors illustrated that the reasons for the model to become dependent on a specific train solver configuration are the use of the bias in the numerical global errors as a feature by the classifier, and the sensitivity of the classifier to changes in the numerical solution
  • The authors have verified this behavior on CIFAR10 as well as a synthetic dataset using fixed step and adaptive methods.
  • The authors plan to eliminate the oscillatory behavior of the adaptation algorithm and improve the tolerance adaptation algorithm to guarantee robust training on many datasets
Summary
  • Introduction:

    The choice of neural network architecture is an important consideration in the deep learning community.
  • Besides the architectural advancements inspired from the original scheme (Zagoruyko & Komodakis, 2016; Xie et al, 2017), recently Neural Ordinary Differential Equation (Neural ODE) models (Chen et al, 2018; E, 2017; Lu et al, 2018; Haber & Ruthotto, 2017) have been proposed as an analog of continuous-depth ResNets.
  • Inspired by the theoretical properties of the solution curves, they propose a regularizer which improved the robustness of Neural ODE models even further.
  • If Neural ODEs are chosen for their theoretical advantages, it is essential that the effective model—the combination of ODE problem and its solution via a particular numerical method—is a close approximation of the true analytical, but practically inaccessible ODE solution
  • Methods:

    The authors introduce a classification task based on the concentric sphere dataset proposed by Dupont et al (2019).
  • For additional results on MNIST the authors refer to the Supplementary Material Section B
  • The aim of these experiments is to analyze the dynamics of Neural ODEs and show its dependence on the specific solver used during training by testing the model with a different solver configuration.
  • In the Supplementary Material Section C the authors provide additional experiments using the (Euler, rk4) and (Midpoint, rk4) as train and test solver pairs.
  • The authors observe this behavior for the tolerance adaptation algorithm for adaptive methods
  • Results:

    In the main text the authors do not plot the results for all the training step sizes/ tolerances but only for every second training step size/tolerance to improve the clarity of the plots.
  • The authors include the plots showing all training runs and the authors include additional results for all datasets.
  • RESULTS FOR FIXED STEP SOLVERS.
  • In addition to the concentric sphere dataset in 2 dimensions, the authors use this data set in higher dimensions 3, 10 and 900.
  • The authors present the results for fixed step solvers.
  • The model is trained with Euler’s method or a 4th order Runge Kutta method with different step sizes.
  • Conclusion:

    Trajectory crossing is a clear indication that the model violates ODE semantics. But even if the authors do not observe trajectory crossing for some step size hthis does not mean that for all h < hwe will no observe crossing trajectories (see Supplementary Material Section A for an example).

    The authors have just described two effects, trajectory crossing and Lady Windermere’s fan, which can lead to a drop in performance when the model is tested with a different solver.
  • The authors illustrated that the reasons for the model to become dependent on a specific train solver configuration are the use of the bias in the numerical global errors as a feature by the classifier, and the sensitivity of the classifier to changes in the numerical solution
  • The authors have verified this behavior on CIFAR10 as well as a synthetic dataset using fixed step and adaptive methods.
  • The authors plan to eliminate the oscillatory behavior of the adaptation algorithm and improve the tolerance adaptation algorithm to guarantee robust training on many datasets
Tables
  • Table1: Results for the accuracy and the number of function evaluations to achieve time continuous dynamics using a grid search and the proposed step adaptation algorithm. For the grid search, we report the accuracy of the run with the smallest step size above the critical threshold
Download tables as Excel
Related work
Funding
  • We do not expect to achieve state-of-the-art results with this simple architecture but we expect our results to remain valid for more complicated architectures
  • If performance of test and train solver agree up to a threshold, we cautiously increase the accuracy parameter
Reference
  • Benny Avelin and Kaj Nystrom. Neural odes as the deep limit of resnets with constant weights. Analysis and Applications, 2020. doi: 10.1142/S0219530520400023.
    Locate open access versionFindings
  • David Balduzzi, Marcus Frean, Lennox Leary, JP Lewis, Kurt Wan-Duo Ma, and Brian McWilliams. The shattered gradients problem: If resnets are the answer, then what is the question? In Proceedings of the 34th International Conference on Machine Learning, pp. 342–350, 2017.
    Google ScholarLocate open access versionFindings
  • Jens Behrmann, Will Grathwohl, Ricky TQ Chen, David Duvenaud, and Jorn-Henrik Jacobsen. Invertible residual networks. arXiv preprint arXiv:1811.00995, 2018.
    Findings
  • Martin Benning, Elena Celledoni, Matthias J. Ehrhardt, Brynjulf Owren, and Carola-Bibiane Schnlieb. Deep learning as optimal control problems: Models and numerical methods. Journal of Computational Dynamics, 6:171, 2019. ISSN 2158-2491. doi: 10.3934/jcd.2019009.
    Locate open access versionFindings
  • Lijun Bo, Agostino Capponi, and Huafu Liao. Relaxed control and gamma-convergence of stochastic optimization problems with mean field. arXiv preprint arXiv:1906.08894, 2019.
    Findings
  • Bo Chang, Lili Meng, Eldad Haber, Frederick Tung, and David Begert. Multi-level residual networks from dynamical systems view. arXiv preprint arXiv:1710.10348, 2017.
    Findings
  • Bo Chang, Lili Meng, Eldad Haber, Lars Ruthotto, David Begert, and Elliot Holtham. Reversible architectures for arbitrarily deep residual neural networks. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
    Google ScholarLocate open access versionFindings
  • Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. In Advances in neural information processing systems, pp. 6571–6583. 2018.
    Google ScholarLocate open access versionFindings
  • Krzysztof Choromanski, Jared Quincy Davis, Valerii Likhosherstov, Xingyou Song, Jean-Jacques Slotine, Jacob Varley, Honglak Lee, Adrian Weller, and Vikas Sindhwani. An ode to an ode. arXiv preprint arXiv:2006.11421, 2020.
    Findings
  • Marco Ciccone, Marco Gallieri, Jonathan Masci, Christian Osendorfer, and Faustino Gomez. Naisnet: Stable deep networks from non-autonomous differential equations. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems 31, pp. 3025–3035. 2018.
    Google ScholarLocate open access versionFindings
  • Miles Cranmer, Sam Greydanus, Stephan Hoyer, Peter Battaglia, David Spergel, and Shirley Ho. Lagrangian neural networks. In ICLR 2020 Workshop on Integration of Deep Neural Models and Differential Equations, 2020.
    Google ScholarLocate open access versionFindings
  • Emilien Dupont, Arnaud Doucet, and Yee Whye Teh. Augmented neural odes. In Advances in Neural Information Processing Systems, pp. 3134–3144. 2019.
    Google ScholarLocate open access versionFindings
  • Weinan E. A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics, 5(1):1–11, 3 2017. doi: 10.1007/s40304-017-0103-z.
    Locate open access versionFindings
  • Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning, volume 1. MIT press Cambridge, 2016.
    Google ScholarFindings
  • Eldad Haber and Lars Ruthotto. Stable architectures for deep neural networks. Inverse Problems, 34 (1):014004, 2017.
    Google ScholarFindings
  • Eldad Haber, Keegan Lensink, Eran Treister, and Lars Ruthotto. IMEXnet a forward stable deep neural network. In Proceedings of the 36th International Conference on Machine Learning, volume 97, pp. 2525–2534, 2019.
    Google ScholarLocate open access versionFindings
  • E. Hairer, S.P. Nørsett, and G. Wanner. Solving Ordinary Differential Equations I – Nonstiff Problems. Springer, 2 edition, 1993. ISBN 978-3-540-78862-1.
    Google ScholarFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
    Google ScholarLocate open access versionFindings
  • Pashupati Hegde, Markus Heinonen, Harri Lahdesmaki, and Samuel Kaski. Deep learning with differential gaussian process flows. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1812–1821, 2019.
    Google ScholarLocate open access versionFindings
  • Jacob Kelly, Jesse Bettencourt, Matthew James Johnson, and David Duvenaud. Learning differential equations that are easy to solve. arXiv preprint arXiv:2007.04504, 2020.
    Findings
  • Qianxiao Li, Ting Lin, and Zuowei Shen. Deep learning via dynamical systems: An approximation perspective. arXiv preprint arXiv:1912.10382, 2019.
    Findings
  • Hongzhou Lin and Stefanie Jegelka. Resnet with one-neuron hidden layers is a universal approximator. In Advances in Neural Information Processing Systems 31, pp. 6169–6178. 2018.
    Google ScholarLocate open access versionFindings
  • Yiping Lu, Aoxiao Zhong, Quanzheng Li, and Bin Dong. Beyond finite layer neural networks: Bridging deep architectures and numerical differential equations. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pp. 3276–3285, 2018.
    Google ScholarLocate open access versionFindings
  • Yiping Lu, Chao Ma, Yulong Lu, Jianfeng Lu, and Lexing Ying. A mean-field analysis of deep resnet and beyond: Towards provable optimization via overparameterization from depth. arXiv preprint arXiv:2003.05508, 2020.
    Findings
  • Stefano Massaroli, Michael Poli, Michelangelo Bin, Jinkyoo Park, Atsushi Yamashita, and Hajime Asama. Stable neural flows. arXiv preprint arXiv:2003.08063, 2020.
    Findings
  • Houman Owhadi and Gene Ryan Yoo. Kernel flows: From learning kernels from data into the abyss. Journal of Computational Physics, 389:22 – 47, 2019. ISSN 0021-9991. doi: https://doi.org/10.1016/j.jcp.2019.03.040.
    Locate open access versionFindings
  • Alejandro F Queiruga, N Benjamin Erichson, Dane Taylor, and Michael W Mahoney. Continuous-indepth neural networks. arXiv preprint arXiv:2008.02389, 2020.
    Findings
  • Lars Ruthotto and Eldad Haber. Deep neural networks motivated by partial differential equations. Journal of Mathematical Imaging and Vision, pp. 1–13, 2019.
    Google ScholarLocate open access versionFindings
  • Sho Sonoda and Noboru Murata. Transport analysis of infinitely deep neural network. Journal of Machine Learning Research, 20(2):1–52, 2019.
    Google ScholarLocate open access versionFindings
  • Matthew Thorpe and Yves van Gennip. Deep limits of residual neural networks. arXiv preprint arXiv:1810.11741, 2018.
    Findings
  • Andreas Veit and Serge Belongie. Convolutional networks with adaptive inference graphs. In The European Conference on Computer Vision (ECCV), September 2018.
    Google ScholarLocate open access versionFindings
  • E Weinan, Jiequn Han, and Qianxiao Li. A mean-field optimal control formulation of deep learning. Research in the Mathematical Sciences, 6(1):10, 2019.
    Google ScholarLocate open access versionFindings
  • Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492–1500, 2017.
    Google ScholarLocate open access versionFindings
  • Hanshu Yan, Jiawei Du, Vincent Tan, and Jiashi Feng. On robustness of neural ordinary differential equations. In International Conference on Learning Representations, 2020.
    Google ScholarLocate open access versionFindings
  • Yibo Yang, Jianlong Wu, Hongyang Li, Xia Li, Tiancheng Shen, and Zhouchen Lin. Dynamical system inspired adaptive time stepping controller for residual network families. In Thirty-Fourht AAAI Conference on Artificial Intelligence, 2020.
    Google ScholarLocate open access versionFindings
  • Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
    Findings
  • Han Zhang, Xi Gao, Jacob Unterman, and Tom Arodz. Approximation capabilities of neural ordinary differential equations. arXiv preprint arXiv:1907.12998, 2019a.
    Findings
  • Under review as a conference paper at ICLR 2021 Jingfeng Zhang, Bo Han, Laura Wynter, Bryan Kian Hsiang Low, and Mohan Kankanhalli. Towards robust resnet: A small step but a giant leap. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pp. 4285–4291, 2019b. Tianjun Zhang, Zhewei Yao, Amir Gholami, Joseph E Gonzalez, Kurt Keutzer, Michael W Mahoney, and George Biros. Anodev2: A coupled neural ode framework. In Advances in Neural Information Processing Systems 32, pp. 5151–5161, 2019c. Juntang Zhuang, Nicha Dvornek, Xiaoxiao Li, Sekhar Tatikonda, Xenophon Papademetris, and James Duncan. Adaptive checkpoint adjoint method for gradient estimation in neural ode. arXiv preprint arXiv:2006.02493, 2020.
    Findings
Your rating :
0

 

Tags
Comments