# ResNet After All: Neural ODEs and Their Numerical Solution

international conference on learning representations, 2020.

Weibo:

Abstract:

A key appeal of the recently proposed Neural Ordinary Differential Equation (ODE) framework is that it seems to provide a continuous-time extension of discrete residual neural networks. As we show herein, though, trained Neural ODE models actually depend on the specific numerical method used during training. If the trained mo...More

Code:

Data:

Introduction

- The choice of neural network architecture is an important consideration in the deep learning community.
- Besides the architectural advancements inspired from the original scheme (Zagoruyko & Komodakis, 2016; Xie et al, 2017), recently Neural Ordinary Differential Equation (Neural ODE) models (Chen et al, 2018; E, 2017; Lu et al, 2018; Haber & Ruthotto, 2017) have been proposed as an analog of continuous-depth ResNets.
- Inspired by the theoretical properties of the solution curves, they propose a regularizer which improved the robustness of Neural ODE models even further.
- If Neural ODEs are chosen for their theoretical advantages, it is essential that the effective model—the combination of ODE problem and its solution via a particular numerical method—is a close approximation of the true analytical, but practically inaccessible ODE solution

Highlights

- The choice of neural network architecture is an important consideration in the deep learning community
- Besides the architectural advancements inspired from the original scheme (Zagoruyko & Komodakis, 2016; Xie et al, 2017), recently Neural Ordinary Differential Equation (Neural ODE) models (Chen et al, 2018; E, 2017; Lu et al, 2018; Haber & Ruthotto, 2017) have been proposed as an analog of continuous-depth Residual Neural Networks (ResNets)
- While Neural ODEs do not necessarily improve upon the sheer predictive performance of ResNets, they offer the vast knowledge of ODE theory to be applied to deep learning research
- As a first step we propose to check how robust the model is with respect to the step size/tolerance to ensure that the resulting model is in a regime where ODE-ness is guaranteed and one can apply reasoning from ODE theory to the model
- We have shown that the step size of fixed step solvers and the tolerance for adaptive methods used for training Neural ODEs impacts whether the resulting model maintains properties of ODE solutions
- We do not expect to achieve state-of-the-art results with this simple architecture but we expect our results to remain valid for more complicated architectures
- We developed step size and tolerance adaptation algorithms, which maintain a continuous ODE interpretation throughout training

Methods

- The authors introduce a classification task based on the concentric sphere dataset proposed by Dupont et al (2019).
- For additional results on MNIST the authors refer to the Supplementary Material Section B
- The aim of these experiments is to analyze the dynamics of Neural ODEs and show its dependence on the specific solver used during training by testing the model with a different solver configuration.
- In the Supplementary Material Section C the authors provide additional experiments using the (Euler, rk4) and (Midpoint, rk4) as train and test solver pairs.
- The authors observe this behavior for the tolerance adaptation algorithm for adaptive methods

Results

- In the main text the authors do not plot the results for all the training step sizes/ tolerances but only for every second training step size/tolerance to improve the clarity of the plots.
- The authors include the plots showing all training runs and the authors include additional results for all datasets.
- RESULTS FOR FIXED STEP SOLVERS.
- In addition to the concentric sphere dataset in 2 dimensions, the authors use this data set in higher dimensions 3, 10 and 900.
- The authors present the results for fixed step solvers.
- The model is trained with Euler’s method or a 4th order Runge Kutta method with different step sizes.

Conclusion

- Trajectory crossing is a clear indication that the model violates ODE semantics. But even if the authors do not observe trajectory crossing for some step size hthis does not mean that for all h < hwe will no observe crossing trajectories (see Supplementary Material Section A for an example).

The authors have just described two effects, trajectory crossing and Lady Windermere’s fan, which can lead to a drop in performance when the model is tested with a different solver. - The authors illustrated that the reasons for the model to become dependent on a specific train solver configuration are the use of the bias in the numerical global errors as a feature by the classifier, and the sensitivity of the classifier to changes in the numerical solution
- The authors have verified this behavior on CIFAR10 as well as a synthetic dataset using fixed step and adaptive methods.
- The authors plan to eliminate the oscillatory behavior of the adaptation algorithm and improve the tolerance adaptation algorithm to guarantee robust training on many datasets

Summary

## Introduction:

The choice of neural network architecture is an important consideration in the deep learning community.- Besides the architectural advancements inspired from the original scheme (Zagoruyko & Komodakis, 2016; Xie et al, 2017), recently Neural Ordinary Differential Equation (Neural ODE) models (Chen et al, 2018; E, 2017; Lu et al, 2018; Haber & Ruthotto, 2017) have been proposed as an analog of continuous-depth ResNets.
- Inspired by the theoretical properties of the solution curves, they propose a regularizer which improved the robustness of Neural ODE models even further.
- If Neural ODEs are chosen for their theoretical advantages, it is essential that the effective model—the combination of ODE problem and its solution via a particular numerical method—is a close approximation of the true analytical, but practically inaccessible ODE solution
## Methods:

The authors introduce a classification task based on the concentric sphere dataset proposed by Dupont et al (2019).- For additional results on MNIST the authors refer to the Supplementary Material Section B
- The aim of these experiments is to analyze the dynamics of Neural ODEs and show its dependence on the specific solver used during training by testing the model with a different solver configuration.
- In the Supplementary Material Section C the authors provide additional experiments using the (Euler, rk4) and (Midpoint, rk4) as train and test solver pairs.
- The authors observe this behavior for the tolerance adaptation algorithm for adaptive methods
## Results:

In the main text the authors do not plot the results for all the training step sizes/ tolerances but only for every second training step size/tolerance to improve the clarity of the plots.- The authors include the plots showing all training runs and the authors include additional results for all datasets.
- RESULTS FOR FIXED STEP SOLVERS.
- In addition to the concentric sphere dataset in 2 dimensions, the authors use this data set in higher dimensions 3, 10 and 900.
- The authors present the results for fixed step solvers.
- The model is trained with Euler’s method or a 4th order Runge Kutta method with different step sizes.
## Conclusion:

Trajectory crossing is a clear indication that the model violates ODE semantics. But even if the authors do not observe trajectory crossing for some step size hthis does not mean that for all h < hwe will no observe crossing trajectories (see Supplementary Material Section A for an example).

The authors have just described two effects, trajectory crossing and Lady Windermere’s fan, which can lead to a drop in performance when the model is tested with a different solver.- The authors illustrated that the reasons for the model to become dependent on a specific train solver configuration are the use of the bias in the numerical global errors as a feature by the classifier, and the sensitivity of the classifier to changes in the numerical solution
- The authors have verified this behavior on CIFAR10 as well as a synthetic dataset using fixed step and adaptive methods.
- The authors plan to eliminate the oscillatory behavior of the adaptation algorithm and improve the tolerance adaptation algorithm to guarantee robust training on many datasets

- Table1: Results for the accuracy and the number of function evaluations to achieve time continuous dynamics using a grid search and the proposed step adaptation algorithm. For the grid search, we report the accuracy of the run with the smallest step size above the critical threshold

Related work

- The connections between ResNets and ODEs have been discussed in E (2017); Lu et al (2018); Haber & Ruthotto (2017); Sonoda & Murata (2019). The authors in Behrmann et al (2018) use similar ideas to build an invertible ResNet. Likewise, additional knowledge about the ODE solvers can be exploited to create more stable and robust architectures with a ResNet backend (Haber & Ruthotto, 2017; Haber et al, 2019; Chang et al, 2018; Ruthotto & Haber, 2019; Ciccone et al, 2018; Cranmer et al, 2020; Benning et al, 2019).

Continuous-depth deep learning was first proposed in Chen et al (2018); E (2017). Although ResNets are universal function approximators (Lin & Jegelka, 2018), Neural ODEs require specific architectural choices to be as expressive as their discrete counterparts (Dupont et al, 2019; Zhang et al, 2019a; Li et al, 2019). In this direction, one common approach is to introduce a time-dependence for the weights of the neural network (Zhang et al, 2019c; Avelin & Nystrom, 2020; Choromanski et al, 2020; Queiruga et al, 2020). Other solutions include, novel Neural ODE models (Lu et al, 2020; Massaroli et al, 2020) with improved training behavior, and variants based on kernels (Owhadi & Yoo, 2019) and Gaussian processes (Hegde et al, 2019). Adaptive ResNet architectures have been proposed in Veit & Belongie (2018); Chang et al (2017). The dynamical systems view of ResNets has lead to the development of methods using time step control as a part of the ResNet architecture (Yang et al, 2020; Zhang et al, 2019b). Thorpe & van Gennip (2018) show that in the deep limit the Neural ODE block and its weights converge. This supports our argument for the existence of a critical step size. Weinan et al (2019) and Bo et al (2019) show the theoretical implications and advantages an continuous formulation ResNet models has.

Funding

- We do not expect to achieve state-of-the-art results with this simple architecture but we expect our results to remain valid for more complicated architectures
- If performance of test and train solver agree up to a threshold, we cautiously increase the accuracy parameter

Reference

- Benny Avelin and Kaj Nystrom. Neural odes as the deep limit of resnets with constant weights. Analysis and Applications, 2020. doi: 10.1142/S0219530520400023.
- David Balduzzi, Marcus Frean, Lennox Leary, JP Lewis, Kurt Wan-Duo Ma, and Brian McWilliams. The shattered gradients problem: If resnets are the answer, then what is the question? In Proceedings of the 34th International Conference on Machine Learning, pp. 342–350, 2017.
- Jens Behrmann, Will Grathwohl, Ricky TQ Chen, David Duvenaud, and Jorn-Henrik Jacobsen. Invertible residual networks. arXiv preprint arXiv:1811.00995, 2018.
- Martin Benning, Elena Celledoni, Matthias J. Ehrhardt, Brynjulf Owren, and Carola-Bibiane Schnlieb. Deep learning as optimal control problems: Models and numerical methods. Journal of Computational Dynamics, 6:171, 2019. ISSN 2158-2491. doi: 10.3934/jcd.2019009.
- Lijun Bo, Agostino Capponi, and Huafu Liao. Relaxed control and gamma-convergence of stochastic optimization problems with mean field. arXiv preprint arXiv:1906.08894, 2019.
- Bo Chang, Lili Meng, Eldad Haber, Frederick Tung, and David Begert. Multi-level residual networks from dynamical systems view. arXiv preprint arXiv:1710.10348, 2017.
- Bo Chang, Lili Meng, Eldad Haber, Lars Ruthotto, David Begert, and Elliot Holtham. Reversible architectures for arbitrarily deep residual neural networks. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. In Advances in neural information processing systems, pp. 6571–6583. 2018.
- Krzysztof Choromanski, Jared Quincy Davis, Valerii Likhosherstov, Xingyou Song, Jean-Jacques Slotine, Jacob Varley, Honglak Lee, Adrian Weller, and Vikas Sindhwani. An ode to an ode. arXiv preprint arXiv:2006.11421, 2020.
- Marco Ciccone, Marco Gallieri, Jonathan Masci, Christian Osendorfer, and Faustino Gomez. Naisnet: Stable deep networks from non-autonomous differential equations. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems 31, pp. 3025–3035. 2018.
- Miles Cranmer, Sam Greydanus, Stephan Hoyer, Peter Battaglia, David Spergel, and Shirley Ho. Lagrangian neural networks. In ICLR 2020 Workshop on Integration of Deep Neural Models and Differential Equations, 2020.
- Emilien Dupont, Arnaud Doucet, and Yee Whye Teh. Augmented neural odes. In Advances in Neural Information Processing Systems, pp. 3134–3144. 2019.
- Weinan E. A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics, 5(1):1–11, 3 2017. doi: 10.1007/s40304-017-0103-z.
- Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning, volume 1. MIT press Cambridge, 2016.
- Eldad Haber and Lars Ruthotto. Stable architectures for deep neural networks. Inverse Problems, 34 (1):014004, 2017.
- Eldad Haber, Keegan Lensink, Eran Treister, and Lars Ruthotto. IMEXnet a forward stable deep neural network. In Proceedings of the 36th International Conference on Machine Learning, volume 97, pp. 2525–2534, 2019.
- E. Hairer, S.P. Nørsett, and G. Wanner. Solving Ordinary Differential Equations I – Nonstiff Problems. Springer, 2 edition, 1993. ISBN 978-3-540-78862-1.
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Pashupati Hegde, Markus Heinonen, Harri Lahdesmaki, and Samuel Kaski. Deep learning with differential gaussian process flows. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1812–1821, 2019.
- Jacob Kelly, Jesse Bettencourt, Matthew James Johnson, and David Duvenaud. Learning differential equations that are easy to solve. arXiv preprint arXiv:2007.04504, 2020.
- Qianxiao Li, Ting Lin, and Zuowei Shen. Deep learning via dynamical systems: An approximation perspective. arXiv preprint arXiv:1912.10382, 2019.
- Hongzhou Lin and Stefanie Jegelka. Resnet with one-neuron hidden layers is a universal approximator. In Advances in Neural Information Processing Systems 31, pp. 6169–6178. 2018.
- Yiping Lu, Aoxiao Zhong, Quanzheng Li, and Bin Dong. Beyond finite layer neural networks: Bridging deep architectures and numerical differential equations. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pp. 3276–3285, 2018.
- Yiping Lu, Chao Ma, Yulong Lu, Jianfeng Lu, and Lexing Ying. A mean-field analysis of deep resnet and beyond: Towards provable optimization via overparameterization from depth. arXiv preprint arXiv:2003.05508, 2020.
- Stefano Massaroli, Michael Poli, Michelangelo Bin, Jinkyoo Park, Atsushi Yamashita, and Hajime Asama. Stable neural flows. arXiv preprint arXiv:2003.08063, 2020.
- Houman Owhadi and Gene Ryan Yoo. Kernel flows: From learning kernels from data into the abyss. Journal of Computational Physics, 389:22 – 47, 2019. ISSN 0021-9991. doi: https://doi.org/10.1016/j.jcp.2019.03.040.
- Alejandro F Queiruga, N Benjamin Erichson, Dane Taylor, and Michael W Mahoney. Continuous-indepth neural networks. arXiv preprint arXiv:2008.02389, 2020.
- Lars Ruthotto and Eldad Haber. Deep neural networks motivated by partial differential equations. Journal of Mathematical Imaging and Vision, pp. 1–13, 2019.
- Sho Sonoda and Noboru Murata. Transport analysis of infinitely deep neural network. Journal of Machine Learning Research, 20(2):1–52, 2019.
- Matthew Thorpe and Yves van Gennip. Deep limits of residual neural networks. arXiv preprint arXiv:1810.11741, 2018.
- Andreas Veit and Serge Belongie. Convolutional networks with adaptive inference graphs. In The European Conference on Computer Vision (ECCV), September 2018.
- E Weinan, Jiequn Han, and Qianxiao Li. A mean-field optimal control formulation of deep learning. Research in the Mathematical Sciences, 6(1):10, 2019.
- Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492–1500, 2017.
- Hanshu Yan, Jiawei Du, Vincent Tan, and Jiashi Feng. On robustness of neural ordinary differential equations. In International Conference on Learning Representations, 2020.
- Yibo Yang, Jianlong Wu, Hongyang Li, Xia Li, Tiancheng Shen, and Zhouchen Lin. Dynamical system inspired adaptive time stepping controller for residual network families. In Thirty-Fourht AAAI Conference on Artificial Intelligence, 2020.
- Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
- Han Zhang, Xi Gao, Jacob Unterman, and Tom Arodz. Approximation capabilities of neural ordinary differential equations. arXiv preprint arXiv:1907.12998, 2019a.
- Under review as a conference paper at ICLR 2021 Jingfeng Zhang, Bo Han, Laura Wynter, Bryan Kian Hsiang Low, and Mohan Kankanhalli. Towards robust resnet: A small step but a giant leap. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pp. 4285–4291, 2019b. Tianjun Zhang, Zhewei Yao, Amir Gholami, Joseph E Gonzalez, Kurt Keutzer, Michael W Mahoney, and George Biros. Anodev2: A coupled neural ode framework. In Advances in Neural Information Processing Systems 32, pp. 5151–5161, 2019c. Juntang Zhuang, Nicha Dvornek, Xiaoxiao Li, Sekhar Tatikonda, Xenophon Papademetris, and James Duncan. Adaptive checkpoint adjoint method for gradient estimation in neural ode. arXiv preprint arXiv:2006.02493, 2020.

Tags

Comments