Towards Safe Policy Improvement for Non-Stationary MDPs

Yash Chandak
Yash Chandak
Scott Jordan
Scott Jordan
Martha White
Martha White

NIPS 2020, 2020.

Cited by: 0|Bibtex|Views97
EI
Other Links: arxiv.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
We provide an illustration of the proposed approach for ensuring safe policy improvement for NS-Markov decision process

Abstract:

Many real-world sequential decision-making problems involve critical systems with financial risks and human-life risks. While several works in the past have proposed methods that are safe for deployment, they assume that the underlying problem is stationary. However, many real-world problems of interest exhibit non-stationarity, and whe...More

Code:

Data:

0
Introduction
  • Reinforcement learning (RL) methods have been applied to real-world sequential decision-making problems such as diabetes management [5], sepsis treatment [50], and budget constrained bidding [66].
  • For such real-world applications, safety guarantees are critical to mitigate serious risks in terms of both human-life and monetary assets.
  • Safety is only ensured by prior methods when this assumption holds, which is rare in real-world problems
Highlights
  • Reinforcement learning (RL) methods have been applied to real-world sequential decision-making problems such as diabetes management [5], sepsis treatment [50], and budget constrained bidding [66]
  • RL algorithms designed to ensure safety [47, 26, 58, 69, 38, 17] model the environment as a Markov decision process (MDP), and rely upon the stationarity assumption made by MDPs [53]
  • In Appendix B, we provide an example of an non-stationary MDP (NS-MDP) for which Theorem 1 holds with exact equality, illustrating that the bound is tight
  • Perhaps counter-intuitively, the failure rate for Baseline is much higher than 5% for slower speeds. This can be attributed to the fact that at higher speeds, greater reward fluctuations result in more variance in the performance estimates, causing the confidence interval (CI) within Baseline to be looser, and thereby causing Baseline to have insufficient confidence of policy improvement to make a policy update
  • Our experimental results call into question the popular misconception that the stationarity assumption is not severe when changes are slow
  • It can be quite the opposite: Slow changes can be more deceptive and can make existing algorithms, which do not account for non-stationarity, more susceptible to deploying unsafe policies
Results
  • An ideal algorithm should adhere to the safety constraint in (1), maximize future performance, and be robust to hyper-parameters even in the presence of non-stationarity.
  • Perhaps counter-intuitively, the failure rate for Baseline is much higher than 5% for slower speeds.
  • At higher speeds Baseline becomes safer as it reverts to πsafe more often.
  • This calls into question the popular misconception that the stationarity assumption is not severe when changes are slow, as in practice slower changes might be harder for an algorithm to identify, and might jeopardize safety.
  • Even though bootstrap CIs do not have guaranteed coverage when using a finite number of samples [24], it still allows SPIN to maintain a failure rate near the 5% target
Conclusion
  • The authors took several first steps towards ensuring safe policy improvement for NS-MDPs.
  • The authors' experimental results call into question the popular misconception that the stationarity assumption is not severe when changes are slow.
  • It can be quite the opposite: Slow changes can be more deceptive and can make existing algorithms, which do not account for non-stationarity, more susceptible to deploying unsafe policies
Summary
  • Introduction:

    Reinforcement learning (RL) methods have been applied to real-world sequential decision-making problems such as diabetes management [5], sepsis treatment [50], and budget constrained bidding [66].
  • For such real-world applications, safety guarantees are critical to mitigate serious risks in terms of both human-life and monetary assets.
  • Safety is only ensured by prior methods when this assumption holds, which is rare in real-world problems
  • Objectives:

    This raises the main question the authors aim to address: How can the authors build sequential decision-making systems that provide safety guarantees for problems with non-stationarities?.
  • The authors aim to create an algorithm alg that ensures with high probability that alg(D), the policy proposed by alg, does not perform worse than the existing safe policy πsafe during the future episode k + δ.
  • The authors aim to ensure the following safety guarantee,.
  • To analyse an algorithm’s behavior, the authors aim to investigate the following three questions:
  • Results:

    An ideal algorithm should adhere to the safety constraint in (1), maximize future performance, and be robust to hyper-parameters even in the presence of non-stationarity.
  • Perhaps counter-intuitively, the failure rate for Baseline is much higher than 5% for slower speeds.
  • At higher speeds Baseline becomes safer as it reverts to πsafe more often.
  • This calls into question the popular misconception that the stationarity assumption is not severe when changes are slow, as in practice slower changes might be harder for an algorithm to identify, and might jeopardize safety.
  • Even though bootstrap CIs do not have guaranteed coverage when using a finite number of samples [24], it still allows SPIN to maintain a failure rate near the 5% target
  • Conclusion:

    The authors took several first steps towards ensuring safe policy improvement for NS-MDPs.
  • The authors' experimental results call into question the popular misconception that the stationarity assumption is not severe when changes are slow.
  • It can be quite the opposite: Slow changes can be more deceptive and can make existing algorithms, which do not account for non-stationarity, more susceptible to deploying unsafe policies
Tables
  • Table1: Ablation study on the RecoSys domain. (Left) Algorithm. (Middle) Improvement over πsafe
  • Table2: List of symbols used in the main paper and their associated meanings
  • Table3: Here, N and η represents the number of gradient steps, and the learning rate used while performing Line 14 of Algorithm 3. The dimension of Fourier basis is given by d. Notice that d is set to different values to provide results for different settings where SPIN is incapable of modeling the performance trend of policies exactly, and thus Assumption 1 is violated. This resembles practical settings, where it is not possible to exactly know the true underlying trend–it can only be coarsely approximated
Download tables as Excel
Funding
  • We are also thankful to Shiv Shankar and the anonymous reviewers for providing feedback that helped improve the paper. This work was supported in part by NSF Award #2018372 and gifts from Adobe Research
  • Further, this work was also supported in part by NSERC and CIFAR, particularly through funding the Alberta Machine Intelligence Institute (Amii) and the CCAI Chair program. Research reported in this paper was also sponsored in part by the CCDC Army Research Laboratory under Cooperative Agreement W911NF-17-2-0196 (ARL IoBT CRA)
Reference
  • D. Abel, Y. Jinnai, S. Y. Guo, G. Konidaris, and M. Littman. Policy and value transfer in lifelong reinforcement learning. In International Conference on Machine Learning, pages 20–29, 2018.
    Google ScholarLocate open access versionFindings
  • J. Achiam, D. Held, A. Tamar, and P. Abbeel. Constrained policy optimization. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 22–31. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • M. Al-Shedivat, T. Bansal, Y. Burda, I. Sutskever, I. Mordatch, and P. Abbeel. Continuous adaptation via meta-learning in nonstationary and competitive environments. arXiv preprint arXiv:1710.03641, 2017.
    Findings
  • H. B. Ammar, R. Tutunov, and E. Eaton. Safe policy search for lifelong reinforcement learning with sublinear regret. In International Conference on Machine Learning, pages 2361–2369, 2015.
    Google ScholarLocate open access versionFindings
  • M. Bastani. Model-free intelligent diabetes management using machine learning. M.S. Thesis, University of Alberta, 2014.
    Google ScholarFindings
  • Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: series B (Methodological), 57(1):289–300, 1995.
    Google ScholarLocate open access versionFindings
  • D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1st edition, 1996. ISBN 1886529108.
    Google ScholarFindings
  • M. Blondel, O. Teboul, Q. Berthet, and J. Djolonga. Fast differentiable sorting and ranking. arXiv preprint arXiv:2002.08871, 2020.
    Findings
  • P. Bloomfield. Fourier analysis of time series: An introduction. John Wiley & Sons, 2004.
    Google ScholarFindings
  • E. Brunskill and L. Li. Pac-inspired option discovery in lifelong reinforcement learning. In International conference on machine learning, pages 316–324, 2014.
    Google ScholarLocate open access versionFindings
  • J. Carpenter and J. Bithell. Bootstrap confidence intervals: When, which, what? A practical guide for medical statisticians. Statistics in Medicine, 19(9):1141–1164, 2000.
    Google ScholarLocate open access versionFindings
  • Y. Chandak, G. Theocharous, C. Nota, and P. S. Thomas. Lifelong learning with a changing action set. In AAAI, pages 3373–3380, 2020.
    Google ScholarLocate open access versionFindings
  • Y. Chandak, G. Theocharous, S. Shankar, M. White, S. Mahadevan, and P. S. Thomas. Optimizing for the future in non-stationary MDPs. In Proceedings of the 37th International Conference on Machine Learning, 2020.
    Google ScholarLocate open access versionFindings
  • P. Chen, T. Pedersen, B. Bak-Jensen, and Z. Chen. ARIMA-based time series model of stochastic wind power generation. IEEE transactions on power systems, 25(2):667–676, 2009.
    Google ScholarLocate open access versionFindings
  • S. X. Chen, W. Härdle, and M. Li. An empirical likelihood goodness-of-fit test for time series. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(3):663–678, 2003.
    Google ScholarLocate open access versionFindings
  • W. C. Cheung, D. Simchi-Levi, and R. Zhu. Drifting reinforcement learning: The blessing of (more) optimism in face of endogenous & exogenous dynamics. Arxiv. 1906.02922v3, 2020.
    Google ScholarLocate open access versionFindings
  • Y. Chow, O. Nachum, E. Duenez-Guzman, and M. Ghavamzadeh. A Lyapunov-based approach to safe reinforcement learning. In Advances in Neural Information Processing Systems, pages 8092–8101, 2018.
    Google ScholarLocate open access versionFindings
  • M. Cuturi, O. Teboul, and J.-P. Vert. Differentiable ranking and sorting using optimal transport. In Advances in Neural Information Processing Systems, pages 6858–6868, 2019.
    Google ScholarLocate open access versionFindings
  • R. Davidson and E. Flachaire. The wild bootstrap, tamed at last. Citeseer, 1999.
    Google ScholarFindings
  • R. Davidson and E. Flachaire. The wild bootstrap, tamed at last. Journal of Econometrics, 146 (1):162–169, 2008.
    Google ScholarFindings
  • T. J. DiCiccio and B. Efron. Bootstrap confidence intervals. Statistical Science, pages 189–212, 1996.
    Google ScholarLocate open access versionFindings
  • A. Djogbenou, S. Gonçalves, and B. Perron. Bootstrap inference in regressions with estimated factors and serial correlation. Journal of Time Series Analysis, 36(3):481–502, 2015.
    Google ScholarLocate open access versionFindings
  • A. A. Djogbenou, J. G. MacKinnon, and M. Ø. Nielsen. Asymptotic theory and wild bootstrap inference with clustered errors. Journal of Econometrics, 212(2):393–412, 2019.
    Google ScholarLocate open access versionFindings
  • B. Efron and R. J. Tibshirani. An introduction to the bootstrap. CRC press, 1994.
    Google ScholarFindings
  • M. Friedrich, S. Smeekes, and J.-P. Urbain. Autoregressive wild bootstrap inference for nonparametric trends. Journal of Econometrics, 214(1):81–109, 2020.
    Google ScholarLocate open access versionFindings
  • J. Garcıa and F. Fernández. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015.
    Google ScholarLocate open access versionFindings
  • L. Godfrey and A. Tremayne. The wild bootstrap and heteroskedasticity-robust tests for serial correlation in dynamic regression models. Computational Statistics & Data Analysis, 49(2): 377–395, 2005.
    Google ScholarLocate open access versionFindings
  • P. Hall. Unusual properties of bootstrap confidence intervals in regression problems. Probability Theory and Related Fields, 81(2):247–273, 1989.
    Google ScholarLocate open access versionFindings
  • P. Hall. The bootstrap and Edgeworth expansion. Springer Science & Business Media, 2013.
    Google ScholarFindings
  • R. Jagerman, I. Markov, and M. de Rijke. When people change their mind: Off-policy evaluation in non-stationary recommendation environments. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pages 447–455, 2019.
    Google ScholarLocate open access versionFindings
  • N. Jiang and L. Li. Doubly robust off-policy value evaluation for reinforcement learning. arXiv preprint arXiv:1511.03722, 2015.
    Findings
  • S. M. Jordan, D. Cohen, and P. S. Thomas. Using cumulative distribution based performance analysis to benchmark models. In NeurIPS 2018 Workshop on Critiquing and Correcting Trends in Machine Learning, 2018.
    Google ScholarLocate open access versionFindings
  • S. Kakade and J. Langford. Approximately optimal approximate reinforcement learning. In ICML, volume 2, pages 267–274, 2002.
    Google ScholarLocate open access versionFindings
  • A. Kazerouni, M. Ghavamzadeh, Y. A. Yadkori, and B. Van Roy. Conservative contextual linear bandits. In Advances in Neural Information Processing Systems, pages 3910–3919, 2017.
    Google ScholarLocate open access versionFindings
  • M. Kearns and S. Singh. Near-optimal reinforcement learning in polynomial time. Machine learning, 49(2-3):209–232, 2002.
    Google ScholarLocate open access versionFindings
  • P. Kline and A. Santos. Higher order properties of the wild bootstrap under misspecification. Journal of Econometrics, 171(1):54–70, 2012.
    Google ScholarLocate open access versionFindings
  • B. P. Kovatchev, M. Breton, C. Dalla Man, and C. Cobelli. In-silico preclinical trials: A proof of concept in closed-loop control of type 1 diabetes, 2009.
    Google ScholarFindings
  • R. Laroche, P. Trichelair, and R. T. d. Combes. Safe policy improvement with baseline bootstrapping. arXiv preprint arXiv:1712.06924, 2017.
    Findings
  • E. Lecarpentier and E. Rachelson. Non-stationary Markov decision processes, a worst-case approach using model-based reinforcement learning. In Advances in Neural Information Processing Systems, pages 7214–7223, 2019.
    Google ScholarLocate open access versionFindings
  • E. Lecarpentier, D. Abel, K. Asadi, Y. Jinnai, E. Rachelson, and M. L. Littman. Lipschitz lifelong reinforcement learning. arXiv preprint arXiv:2001.05411, 2020.
    Findings
  • R. Y. Liu et al. Bootstrap procedures under some non-iid models. The Annals of Statistics, 16 (4):1696–1708, 1988.
    Google ScholarLocate open access versionFindings
  • J. G. MacKinnon. Inference based on the wild bootstrap. In Seminar presentation given to Carleton University in September, 2012.
    Google ScholarFindings
  • E. Mammen. Bootstrap and wild bootstrap for high dimensional linear models. The Annals of Statistics, pages 255–285, 1993.
    Google ScholarLocate open access versionFindings
  • C. D. Man, F. Micheletto, D. Lv, M. Breton, B. Kovatchev, and C. Cobelli. The UVA/PADOVA type 1 diabetes simulator: New features. Journal of Diabetes Science and Technology, 8(1): 26–34, 2014.
    Google ScholarLocate open access versionFindings
  • B. Metevier, S. Giguere, S. Brockman, A. Kobren, Y. Brun, E. Brunskill, and P. S. Thomas. Offline contextual bandits with high probability fairness guarantees. In Advances in Neural Information Processing Systems, pages 14893–14904, 2019.
    Google ScholarLocate open access versionFindings
  • J. Pineau, A. Guez, R. Vincent, G. Panuccio, and M. Avoli. Treating epilepsy via adaptive neurostimulation: A reinforcement learning approach. International journal of neural systems, 19(04):227–240, 2009.
    Google ScholarLocate open access versionFindings
  • M. Pirotta, M. Restelli, A. Pecorino, and D. Calandriello. Safe policy iteration. In International Conference on Machine Learning, pages 307–315, 2013.
    Google ScholarLocate open access versionFindings
  • D. Precup. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, page 80, 2000.
    Google ScholarLocate open access versionFindings
  • B. Ravindran and A. G. Barto. Approximate homomorphisms: A framework for non-exact minimization in Markov decision processes. 2004.
    Google ScholarFindings
  • S. Saria. Individualized sepsis treatment using reinforcement learning. Nature Medicine, 24 (11):1641–1642, 2018.
    Google ScholarLocate open access versionFindings
  • S. Shalev-Shwartz et al. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107–194, 2012.
    Google ScholarLocate open access versionFindings
  • Student. The probable error of a mean. Biometrika, pages 1–25, 1908.
    Google ScholarLocate open access versionFindings
  • R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
    Google ScholarFindings
  • G. Theocharous, P. S. Thomas, and M. Ghavamzadeh. Personalized ad recommendation systems for life-time value optimization with guarantees. In Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015.
    Google ScholarLocate open access versionFindings
  • G. Theocharous, Y. Chandak, P. S. Thomas, and F. de Nijs. Reinforcement learning for strategic recommendations. arXiv preprint arXiv:2009.07346, 2020.
    Findings
  • P. Thomas and E. Brunskill. Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pages 2139–2148, 2016.
    Google ScholarLocate open access versionFindings
  • P. Thomas, G. Theocharous, and M. Ghavamzadeh. High confidence policy improvement. In International Conference on Machine Learning, pages 2380–2388, 2015.
    Google ScholarLocate open access versionFindings
  • P. S. Thomas. Safe reinforcement learning. PhD thesis, University of Massachusetts Libraries, 2015.
    Google ScholarFindings
  • P. S. Thomas, G. Theocharous, and M. Ghavamzadeh. High-confidence off-policy evaluation. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
    Google ScholarLocate open access versionFindings
  • P. S. Thomas, G. Theocharous, M. Ghavamzadeh, I. Durugkar, and E. Brunskill. Predictive off-policy policy evaluation for nonstationary decision problems, with applications to digital marketing. In Twenty-Ninth IAAI Conference, 2017.
    Google ScholarLocate open access versionFindings
  • P. S. Thomas, B. Castro da Silva, A. G. Barto, S. Giguere, Y. Brun, and E. Brunskill. Preventing undesirable behavior of intelligent machines. Science, 366(6468):999–1004, 2019.
    Google ScholarLocate open access versionFindings
  • L. Wasserman. All of statistics: A concise course in statistical inference. Springer Science & Business Media, 2013.
    Google ScholarFindings
  • W. Whitt. Approximations of dynamic programs, i. Mathematics of Operations Research, 3(3): 231–243, 1978.
    Google ScholarLocate open access versionFindings
  • V. Wieland and M. Wolters. Forecasting and policy making. In Handbook of economic forecasting, volume 2, pages 239–325.
    Google ScholarLocate open access versionFindings
  • C.-F. J. Wu et al. Jackknife, bootstrap and other resampling methods in regression analysis. the Annals of Statistics, 14(4):1261–1295, 1986.
    Google ScholarLocate open access versionFindings
  • D. Wu, X. Chen, X. Yang, H. Wang, Q. Tan, X. Zhang, J. Xu, and K. Gai. Budget constrained bidding by model-free reinforcement learning in display advertising. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pages 1443–1451, 2018.
    Google ScholarLocate open access versionFindings
  • A. Xie, J. Harrison, and C. Finn. Deep reinforcement learning amidst lifelong non-stationarity. arXiv preprint arXiv:2006.10701, 2020.
    Findings
  • J. Xie. Simglucose v0.2.1 (2018), 2019. URL https://github.com/jxx123/simglucose.
    Locate open access versionFindings
  • J. Zhang and K. Cho. Query-efficient imitation learning for end-to-end autonomous driving. arXiv preprint arXiv:1605.06450, 2016.
    Findings
Full Text
Your rating :
0

 

Tags
Comments