# Differentiable Likelihoods for Fast Inversion of 'Likelihood-Free' Dynamical Systems

ICML, pp. 5198-5208, 2020.

EI

Weibo:

Abstract:

Likelihood-free (a.k.a. simulation-based) inference problems are inverse problems with expensive, or intractable, forward models. ODE inverse problems are commonly treated as likelihood-free, as their forward map has to be numerically approximated by an ODE solver. This, however, is not a fundamental constraint but just a lack of functi...More

Code:

Data:

Introduction

- Inferring the parameters of dynamical systems that are defined by ordinary differential equations (ODEs) is of importance in almost all areas of science and engineering.
- If a dynamical system is accurately described by an ODE, its explicit mathematical definition should be exploited to design efficient algorithms—not ignored and treated as a black-box, likelihood-free inference problem
- To this end, the authors construct a local Gaussian approximation of the likelihood by Gaussian ODE Filtering, a probabilistic numerical method (PNM) for ODE forward problems.
- Of the approximated loglikelihood can—via a cheap estimator J of the Jacobian of the map θ → mθ—be computed in closed form (Section 5)
- In this way, the probabilistic information from Gaussian ODE filtering yields a tractable, twice-differentiable likelihood for ‘likelihood-free’ ODE inverse problems.
- Much thought has been devoted to improving the slow run-

Highlights

- Inferring the parameters of dynamical systems that are defined by ordinary differential equations (ODEs) is of importance in almost all areas of science and engineering
- See Hennig et al (2015) or Oates & Sullivan (2019) for a broad introduction to probabilistic numerical method.) The key insight of our work is that there is a likelihood in simulations of ordinary differential equations, and it can be approximated cheaply, and analytically: The mean estimate mθ of the forward solution computed by Gaussian ordinary differential equations filters can be linearized in the parameter θ, so that gradient, Hessian, etc. of the approximated loglikelihood can—via a cheap estimator J of the Jacobian of the map θ → mθ—be computed in closed form (Section 5)
- Out of the new family of gradient-based optimizers and samplers introduced in Section 6, we evaluate only the most basic ones: gradient descent (GD) and Newton’s method (NWT) for optimization, as well as PLMC and PHMC for sampling
- As optimizers and samplers differ in their purpose, we evaluate them in a slightly different way: Optimizers try to quickly find a local minimum and we plot the negative log-likelihood at each iteration
- We introduced a novel Jacobian estimator for ordinary differential equations solutions w.r.t. their parameters which implies approximate estimators of the gradient and Hessian of the log-likelihood
- We proposed new first and secondorder optimization and sampling methods for ordinary differential equations inverse problems which outperformed standard ‘likelihoodfree’ approaches—namely random search optimization and random-walk Metropolis MCMC—in all conducted experiments

Methods

- To test the hypothesis that the gradient and Hessian estimators [∇ˆ θE(z), ∇ˆ 2θE(z)] of the log-likelihood are useful despite their approximate nature, the authors compare the new optimization and sampling methods from Section 6—which use these estimators as if exact—with the standard ‘likelihoodfree’ approach, i.e. with random search (RS) optimization and random-walk Metropolis (RWM) sampling.

7.1. - The authors picked the best the step size and, for all samplers, the best proposal width within the interval [10−16, 100] which is wide enough to contain all plausible values
- To make these experiments an ablation study for the gradient and Hessian estimators, the authors use Gaussian ODE filtering as a forward solver in all methods—which is similar to classical solvers anyway (Schober et al, 2018, Section 3).
- The same applies in the regime P σ2IM ; adaptation of their relative scale, by choosing σd2if as in Section 6.3, only matters when both error-sources are of comparable scale

Results

- The authors evaluate the performance of these methods over the first few iterations. As optimizers and samplers differ in their purpose, the authors evaluate them in a slightly different way: Optimizers try to quickly find a local minimum and the authors plot the negative log-likelihood at each iteration (lower is better).
- As optimizers and samplers differ in their purpose, the authors evaluate them in a slightly different way: Optimizers try to quickly find a local minimum and the authors plot the negative log-likelihood at each iteration.
- On the other hand, try to find and cover regions of high probability; see e.g.
- Successful sampling tends to go into and stay in regions of high probability.
- Samplers that ‘fall off’ into regions of low likelihood, on the other hand, have failed.
- LOTKA–VOLTERRA First, the authors study the Lotka–Volterra (LV) ODE (Lotka, 1978)

Conclusion

- The authors introduced a novel Jacobian estimator for ODE solutions w.r.t. their parameters which implies approximate estimators of the gradient and Hessian of the log-likelihood.
- Using these estimators, the authors proposed new first and secondorder optimization and sampling methods for ODE inverse problems which outperformed standard ‘likelihoodfree’ approaches—namely random search optimization and random-walk Metropolis MCMC—in all conducted experiments.
- The employed Jacobian estimator constitutes a new method for fast, approximate sensitivity analysis

Summary

## Introduction:

Inferring the parameters of dynamical systems that are defined by ordinary differential equations (ODEs) is of importance in almost all areas of science and engineering.- If a dynamical system is accurately described by an ODE, its explicit mathematical definition should be exploited to design efficient algorithms—not ignored and treated as a black-box, likelihood-free inference problem
- To this end, the authors construct a local Gaussian approximation of the likelihood by Gaussian ODE Filtering, a probabilistic numerical method (PNM) for ODE forward problems.
- Of the approximated loglikelihood can—via a cheap estimator J of the Jacobian of the map θ → mθ—be computed in closed form (Section 5)
- In this way, the probabilistic information from Gaussian ODE filtering yields a tractable, twice-differentiable likelihood for ‘likelihood-free’ ODE inverse problems.
- Much thought has been devoted to improving the slow run-
## Methods:

To test the hypothesis that the gradient and Hessian estimators [∇ˆ θE(z), ∇ˆ 2θE(z)] of the log-likelihood are useful despite their approximate nature, the authors compare the new optimization and sampling methods from Section 6—which use these estimators as if exact—with the standard ‘likelihoodfree’ approach, i.e. with random search (RS) optimization and random-walk Metropolis (RWM) sampling.

7.1.- The authors picked the best the step size and, for all samplers, the best proposal width within the interval [10−16, 100] which is wide enough to contain all plausible values
- To make these experiments an ablation study for the gradient and Hessian estimators, the authors use Gaussian ODE filtering as a forward solver in all methods—which is similar to classical solvers anyway (Schober et al, 2018, Section 3).
- The same applies in the regime P σ2IM ; adaptation of their relative scale, by choosing σd2if as in Section 6.3, only matters when both error-sources are of comparable scale
## Results:

The authors evaluate the performance of these methods over the first few iterations. As optimizers and samplers differ in their purpose, the authors evaluate them in a slightly different way: Optimizers try to quickly find a local minimum and the authors plot the negative log-likelihood at each iteration (lower is better).- As optimizers and samplers differ in their purpose, the authors evaluate them in a slightly different way: Optimizers try to quickly find a local minimum and the authors plot the negative log-likelihood at each iteration.
- On the other hand, try to find and cover regions of high probability; see e.g.
- Successful sampling tends to go into and stay in regions of high probability.
- Samplers that ‘fall off’ into regions of low likelihood, on the other hand, have failed.
- LOTKA–VOLTERRA First, the authors study the Lotka–Volterra (LV) ODE (Lotka, 1978)
## Conclusion:

The authors introduced a novel Jacobian estimator for ODE solutions w.r.t. their parameters which implies approximate estimators of the gradient and Hessian of the log-likelihood.- Using these estimators, the authors proposed new first and secondorder optimization and sampling methods for ODE inverse problems which outperformed standard ‘likelihoodfree’ approaches—namely random search optimization and random-walk Metropolis MCMC—in all conducted experiments.
- The employed Jacobian estimator constitutes a new method for fast, approximate sensitivity analysis

Funding

- HK, NK and PH gratefully acknowledge financial support by the European Research Council through ERC StG Action 757275 / PANAMA; the DFG Cluster of Excellence Machine Learning - New Perspectives for Science, EXC 2064/1, project number 390727645; the German Federal Ministry of Education and Research (BMBF) through the Tbingen AI Center (FKZ: 01IS18039A); and funds from the Ministry of Science, Research and Arts of the State of Baden-Wrttemberg
- HK, NK and PH also gratefully acknowledge financial support by the German Federal Ministry of Education and Research (BMBF) through Project ADIMEM (FKZ 01IS18052B)

Reference

- Abdulle, A. and Garegnani, G. Random time step probabilistic methods for uncertainty quantification in chaotic and geometric numerical integration. arXiv:1703.03680 [math.NA], 2018.
- Betancourt, M. A conceptual introduction to Hamiltonian Monte Carlo. arXiv:1701.02434 [stat.ME], 2017.
- Bottou, L., Curtis, F., and Nocedal, J. Optimization methods for large-scale machine learning. SIAM Review, 2018.
- Calderhead, B., Girolami, M., and Lawrence, N. Accelerating Bayesian inference over nonlinear differential equations with Gaussian processes. In Advances in Neural Information Processing Systems (NeurIPS), 2008.
- Chen, R., Rubanova, Y., Bettencourt, J., and Duvenaud, D. Neural ordinary differential equations. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
- Chkrebtii, O. A., Campbell, D. A., Calderhead, B., and Girolami, M. A. Bayesian solution uncertainty quantification for differential equations. Bayesian Analysis, 11 (4):1239–1267, 2016.
- Cockayne, J., Oates, C., Sullivan, T., and Girolami, M. Probabilistic numerical methods for PDE-constrained Bayesian inverse problems. AIP Conference Proceedings, 2017.
- Conrad, P. R., Girolami, M., Sarkka, S., Stuart, A., and Zygalakis, K. Statistical analysis of differential equations: introducing probability measures on numerical solutions. Statistics and Computing, 2017.
- Cranmer, K., Brehmer, J., and Louppe, G. The frontier of simulation-based inference. arXiv:1911.01429 [stat.ML], 2019.
- Girolami, M. and Calderhead, B. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B, 2011.
- Golub, G. and Van Loan, C. Matrix computations. Johns Hopkins University Press, 4th edition, 1996.
- Gorbach, N. S., Bauer, S., and Buhmann, J. M. Scalable variational inference for dynamical systems. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
- Hartikainen, J. and Sarkka, S. Kalman filtering and smoothing solutions to temporal Gaussian process regression models. In IEEE International Workshop on Machine Learning for Signal Processing (MLSP), 2010.
- Hennig, P., Osborne, M. A., and Girolami, M. Probabilistic numerics and uncertainty in computations. Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 471(2179), 2015.
- Hull, T., Enright, W., Fellen, B., and Sedgwick, A. Comparing numerical methods for ordinary differential equations. SIAM Journal on Numerical Analysis, 9(4):603– 637, 1972.
- John, D., Heuveline, V., and Schober, M. GOODE: A Gaussian off-the-shelf ordinary differential equation solver. In International Conference on Machine Learning (ICML), 2019.
- Kelley, W. and Peterson, A. The Theory of Differential Equations: Classical and Qualitative. Springer, 2010.
- Kersting, H., Sullivan, T. J., and Hennig, P. Convergence rates of Gaussian ODE filters. arXiv:1807.09737v2 [math.NA], 2019.
- Lie, H. C., Stuart, A. M., and Sullivan, T. J. Strong convergence rates of probabilistic integrators for ordinary differential equations. Statistics and Computing, 2019.
- Lotka, A. The growth of mixed populations: two species competing for a common food supply. The Golden Age of Theoretical Ecology: 19231940, 1978.
- Macdonald, B. and Husmeier, D. Gradient matching methods for computational inference in mechanistic models for systems biology: A review and comparative analysis. Frontiers in bioengineering and biotechnology, 3, 2015.
- MacKay, D. Information Theory, Inference, and Learning Algorithms. Cambridge University Press, 2003.
- Matsuda, T. and Miyatake, Y. Estimation of ordinary differential equation models with discretization error quantification. arXiv:1907.10565 [stat.ME], 2019.
- Meeds, E. and Welling, M. GPS-ABC: Gaussian process surrogate approximate Bayesian computation. In Uncertainty in Artificial Intelligence (UAI), 2014.
- Oates, C., Cockayne, J., Aykroyd, R., and Girolami, M. Bayesian probabilistic numerical methods in timedependent state estimation for industrial hydrocyclone equipment. Journal of the American Statistical Association, 2019.
- Oates, C. J. and Sullivan, T. J. A modern retrospective on probabilistic numerics. Statistics and Computing, 2019.
- O’Hagan, A. Bayesian analysis of computer code outputs: A tutorial. Reliability Engineering & System Safety, 2006.
- Perdikaris, P. and Karniadakis, G. E. Model inversion via multi-fidelity Bayesian optimization: a new paradigm for parameter estimation in haemodynamics, and beyond. Journal of the Royal Society Interface, 13, 2016.
- Qi, Y. and Minka, T. Hessian-based Markov chain MonteCarlo algorithms. Workshop on Monte Carlo Methods, 2002.
- Rackauckas, C., Ma, Y., Dixit, V., Guo, X., Innes, M., Revels, J., Nyberg, J., and Ivaturi, V. A comparison of automatic differentiation and continuous sensitivity analysis for derivatives of differential equation solutions. arXiv:1812.01892 [math.NA], 2018.
- Rastrigin, L. A. The convergence of the random search method in the extremal control of a many parameter system. Automation and Remote Control, 24, 1963.
- Roberts, G. and Tweedie, R. Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, 2(4):341–363, 1996.
- Sacks, J., Welch, W. J., Mitchell, T. J., and Wynn, H. P. Design and analysis of computer experiments. Statistical Science, 1989.
- Sarkka, S. Bayesian Filtering and Smoothing. Cambridge University Press, 2013.
- Sarkka, S. and Solin, A. Applied Stochastic Differential Equations. Cambridge University Press, 2019.
- Schiesser, W. E. and Griffiths, G. W. A Compendium of Partial Differential Equation Models: Method of Lines Analysis with Matlab. Cambridge University Press, 1st edition, 2009.
- Schillings, C., Sunnaker, M., Stelling, J., and Schwab, C. Efficient characterization of parametric uncertainty of complex (bio)chemical networks. PLOS Computational Biology, 11, 2015.
- Schober, M., Duvenaud, D., and Hennig, P. Probabilistic ODE solvers with Runge–Kutta means. In Advances in Neural Information Processing Systems (NeurIPS), 2014.
- Schober, M., Sarkka, S., and Hennig, P. A probabilistic model for the numerical solution of initial value problems. Statistics and Computing, 2018.
- Solak, E., Murray-Smith, R., Leithead, W. E., Leith, D. J., and Rasmussen, C. E. Derivative observations in Gaussian process models of dynamic systems. In Advances in Neural Information Processing Systems (NeurIPS), 2003.
- Tarantola, A. Inverse Problem Theory and Methods for Model Parameter Estimation. SIAM, 2005.
- Teymur, O., Lie, H. C., Sullivan, T. J., and Calderhead, B. Implicit probabilistic integrators for ODEs. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
- Tronarp, F., Kersting, H., Sarkka, S., and Hennig, P. Probabilistic solutions to ordinary differential equations as nonlinear Bayesian filtering: a new perspective. Statistics and Computing, 2019.
- Vyshemirsky, V. and Girolami, M. A. Bayesian ranking of biochemical system models. Bioinformatics, 24(6): 833–839, 2008.
- Wenk, P., Abbati, G., Bauer, S., Osborne, M. A., Krause, A., and Scholkopf, B. ODIN: ODE-informed regression for parameter and state inference in time-continuous dynamical systems. In AAAI Conference on Artificial Intelligence, 2019.
- The probabilistic state space model is thereby completely defined. Gaussian ODE filtering is equivalent to running a Gaussian filter on this probabilistic state space model. For more details on Gaussian ODE filters, see Kersting et al. (2019) or Schober et al. (2018). An extension to more Bayesian filters—such as particle filters—is provided by Tronarp et al. (2019).
- from a GP prior on x eq. (38) and a linear Gaussian measurement model (eq. (40)) with derivative data (eq. (45)). Hence, the classical framework for GP regression with derivative observations, as introduced in Solak et al. (2003), is applicable. It a priori models the state x and its derivative xas a multi-task GP: p x x
- A detailed derivation of eqs. (49) to (51) can be found in Schober et al. (2014, Supplement B).
- In eq. (77), we exploited the fact that Q ≤ Q + Q (i.e. that (Q + Q ) − Q = Q is positive semi-definite) and therefore λi(Q) ≤ λi(Q + Q ) for ordered eigenvalues λ1(Q) ≤ · · · ≤ λm∧n(Q) counted by algebraic multiplicity. This fact is an immediate consequence of Theorem 8.1.5. in Golub & Van Loan (1996).
- Proof. First, recall that the convergence rates of O(h) provided by Theorem 6.7 in Kersting et al. (2019) only depend on f through the dependence of the constant K(T ) > 0 on the Lipschitz constant L of f. But this L is independent of θ by Assumption 1.
- Hence, Theorem 6.7 from Kersting et al. (2019) yields under Assumption 2 that sup θ∈Θo m−θ (T
- Moreover, Theorem 8.49 in Kelley & Peterson (2010) is applicable under Assumption 1 and implies that xθ(t) is continuous and has continuous first-order partial derivatives with respect to of θk. By construction—recall eq. (10)—the filtering mean mθ(t) has the same regularity too. Hence, application of Lemma 4 with x = h, Λ = Θ, λ = θ, gfr(oxm, λe)q.=(9m1)−θ. (T )−xθ(T ) is possible, which yields eq. (90)
- The Glucose uptake in yeast (GUiY) is described by massaction kinetics. In the notation of Schillings et al. (2015), the underlying ODE is given by: xeGlc = −k1xeE xeGlc + k−1xeE–Glc xiGlc = −k2xiE xiGlc + k−2xiE–Glc xiE–G6P = k4xiE xiG6P + k−4xiE–G6P xiE–Glc–G6P = k3xiE–GlcxiG6P − k−3xiE–Glc–G6P xiG6P = −k3xiE–GlcxiG6P + k−3xiE–Glc–G6P
- − k2xiExiGlc + k−2xiE–Glc, where k1, k−1, k2, k−2, k3, k−3, k4, k−4, α, and β are the 10 parameters. Note that this system satisfies Assumption 1. Following Schillings et al. (2015) and Gorbach et al. (2017), we used this ODE with initial value x0 = 1M, time interval [0., 100.] and true parameter θ∗ = [0.1, 0.0, 0.4, 0.0, 0.3, 0.0, 0.7, 0.0, 0.1, 0.2]. To generate data by eq. (3), we added Gaussian noise with variance σ2 = 10−5 to the corresponding solution at time points [1., 2., 4., 5., 7., 10., 15., 20., 30., 40., 50., 60., 80., 100.]. The optimizers and samplers were initialized at θ = 1.2 · θ∗ = [0.12, 0, 0.48, 0, 0.36, 0, 0.84, 0, 0.12, 0.24], and the forward solutions for all likelihood evaluations were computed with step size h = 1.0.

Tags

Comments