Optimizing for the Future in Non-Stationary MDPs

Yash Chandak
Yash Chandak
Georgios Theocharous
Georgios Theocharous
Shiv Shankar
Shiv Shankar

ICML, pp. 1414-1425, 2020.

Cited by: 1|Bibtex|Views30
EI
Other Links: academic.microsoft.com|dblp.uni-trier.de
Weibo:
We presented a policy gradient-based algorithm that combines counter-factual reasoning with curve-fitting to proactively search for a good policy for future Markov decision processes

Abstract:

Most reinforcement learning methods are based upon the key assumption that the transition dynamics and reward functions are fixed, that is, the underlying Markov decision process is stationary. However, in many real-world applications, this assumption is violated and using existing algorithms may result in a performance lag. To proactivel...More

Code:

Data:

0
Introduction
  • Policy optimization algorithms in RL are promising for obtaining general purpose control algorithms.
  • They are designed for Markov decision processes (MDPs), which model a large class of problems (Sutton & Barto, 2018).
  • Most existing algorithms assume that the environment remains stationary over time.
  • This assumption is often violated in practical problems of interest.
  • Tires suffer from wear and tear, leading to in-
Highlights
  • Policy optimization algorithms in RL are promising for obtaining general purpose control algorithms
  • They are designed for Markov decision processes (MDPs), which model a large class of problems (Sutton & Barto, 2018)
  • In this paper we present a policy gradient based approach to search for a policy that maximizes the forecasted future performance when the environment is non-stationary
  • FTRL-PG has a slight edge over ONPG when the environment is stationary as all the past data is directly indicative of the future Markov decision processes
  • We presented a policy gradient-based algorithm that combines counter-factual reasoning with curve-fitting to proactively search for a good policy for future Markov decision processes
  • Our method provides a single solution for mitigating performance lag and being data-efficient
Results
  • In the non-stationary recommender system, as the exact value of Jk∗ is available from the simulator, the authors can compute the true value of regret.
  • For the non-stationary goal reacher and diabetes treatment environment, as Jk∗.
  • Is not known for any k, the authors use a surrogate measure for regret.
  • It is interesting to note that while FTRL-PG works the best for the stationary setting in the recommender system and the goal reacher task, it is not the best in the diabetes treatment task as it can suffer from high variance.
Conclusion
  • The authors presented a policy gradient-based algorithm that combines counter-factual reasoning with curve-fitting to proactively search for a good policy for future MDPs. Irrespective of the environment being stationary or non-stationary, the proposed method can leverage all the past data, and in nonstationary settings it can pro-actively optimize for future performance as well.
  • Keeping λ too high prevents the policy from adapting quickly.
  • While the authors resorted to hyper-parameter search, leveraging methods that adapt λ automatically might be fruitful (Haarnoja et al, 2018)
Summary
  • Introduction:

    Policy optimization algorithms in RL are promising for obtaining general purpose control algorithms.
  • They are designed for Markov decision processes (MDPs), which model a large class of problems (Sutton & Barto, 2018).
  • Most existing algorithms assume that the environment remains stationary over time.
  • This assumption is often violated in practical problems of interest.
  • Tires suffer from wear and tear, leading to in-
  • Results:

    In the non-stationary recommender system, as the exact value of Jk∗ is available from the simulator, the authors can compute the true value of regret.
  • For the non-stationary goal reacher and diabetes treatment environment, as Jk∗.
  • Is not known for any k, the authors use a surrogate measure for regret.
  • It is interesting to note that while FTRL-PG works the best for the stationary setting in the recommender system and the goal reacher task, it is not the best in the diabetes treatment task as it can suffer from high variance.
  • Conclusion:

    The authors presented a policy gradient-based algorithm that combines counter-factual reasoning with curve-fitting to proactively search for a good policy for future MDPs. Irrespective of the environment being stationary or non-stationary, the proposed method can leverage all the past data, and in nonstationary settings it can pro-actively optimize for future performance as well.
  • Keeping λ too high prevents the policy from adapting quickly.
  • While the authors resorted to hyper-parameter search, leveraging methods that adapt λ automatically might be fruitful (Haarnoja et al, 2018)
Related work
  • The problem of non-stationarity has a long history and no effort is enough to thoroughly review it. Here, we briefly touch upon the most relevant work and defer a more detailed literature review to the appendix. A more exhaustive survey can be found in the work by Padakandla (2020).

    Perhaps the work most closely related to ours is that of AlShedivat et al (2017). They consider a setting where an agent is required to solve test tasks that have different transition dynamics than the training tasks. Using meta-learning, they aim to use training tasks to find an initialization vector for the policy parameters that can be quickly fine-tuned when facing tasks in the test set. In many real-world problems, however, access to such independent training tasks may not be available a priori. In this work, we are interested in the continually changing setting where there is no boundary between training and testing tasks. As such, we show how their proposed online adaptation technique that fine-tunes parameters, by discarding past data and only using samples observed online, can create performance lag and can therefore be data-inefficient. In settings where training and testing tasks do exist, our method can be leveraged to better adapt during test time, starting from any desired parameter vector.
Funding
  • The research was later supported by generous gifts from Adobe Research
  • Jordan, and Chris Nota for insightful discussions and for providing valuable feedback. Research reported in this paper was also sponsored in part by the CCDC Army Research Laboratory under Cooperative Agreement W911NF-17-2-0196 (ARL IoBT CRA)
Reference
  • Abbasi, Y., Bartlett, P. L., Kanade, V., Seldin, Y., and Szepesvári, C. Online learning in Markov decision processes with adversarially chosen transition probability distributions. In Advances in Neural Information Processing Systems, pp. 2508–2516, 2013.
    Google ScholarLocate open access versionFindings
  • Abdallah, S. and Kaisers, M. Addressing environment nonstationarity by repeating q-learning updates. The Journal of Machine Learning Research, 2016.
    Google ScholarLocate open access versionFindings
  • Al-Shedivat, M., Bansal, T., Burda, Y., Sutskever, I., Mordatch, I., and Abbeel, P. Continuous adaptation via metalearning in nonstationary and competitive environments. arXiv preprint arXiv:1710.03641, 2017.
    Findings
  • Foster, D. J., Li, Z., Lykouris, T., Sridharan, K., and Tardos, E. Learning in games: Robustness of fast convergence. In Advances in Neural Information Processing Systems, pp. 4734–4742, 2016.
    Google ScholarLocate open access versionFindings
  • Bastani, M. Model-free intelligent diabetes management using machine learning. M.S. Thesis, University of Alberta, 2014.
    Google ScholarFindings
  • Besbes, O., Gur, Y., and Zeevi, A. Stochastic multi-armedbandit problem with non-stationary rewards. In Advances in Neural Information Processing Systems, pp. 199–207, 2014.
    Google ScholarLocate open access versionFindings
  • Bishop, C. M. Pattern recognition and machine learning. springer, 2006.
    Google ScholarFindings
  • Bowling, M. Convergence and no-regret in multiagent learning. In Advances in Neural Information Processing Systems, pp. 209–216, 2005.
    Google ScholarLocate open access versionFindings
  • Cheevaprawatdomrong, T., Schochetman, I. E., Smith, R. L., and Garcia, A. Solution and forecast horizons for infinitehorizon nonhomogeneous Markov decision processes. Mathematics of Operations Research, 32(1):51–72, 2007.
    Google ScholarLocate open access versionFindings
  • Cheung, W. C., Simchi-Levi, D., and Zhu, R. Reinforcement learning under drift. arXiv preprint arXiv:1906.02922, 2019.
    Findings
  • Choi, S. P., Yeung, D.-Y., and Zhang, N. L. An environment model for nonstationary reinforcement learning. In Advances in Neural Information Processing Systems, pp. 987–993, 2000.
    Google ScholarLocate open access versionFindings
  • Conitzer, V. and Sandholm, T. Awesome: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents. Machine Learning, 67(1-2):23–43, 2007.
    Google ScholarLocate open access versionFindings
  • Cuzick, J. A strong law for weighted sums of i.i.d. random variables. Journal of Theoretical Probability, 8(3):625– 641, 1995.
    Google ScholarLocate open access versionFindings
  • Even-Dar, E., Kakade, S. M., and Mansour, Y. Experts in a Markov decision process. In Advances in Neural Information Processing Systems, pp. 401–408, 2005.
    Google ScholarLocate open access versionFindings
  • Finn, C., Rajeswaran, A., Kakade, S., and Levine, S. Online meta-learning. arXiv preprint arXiv:1902.08438, 2019.
    Findings
  • Gajane, P., Ortner, R., and Auer, P. A sliding-window algorithm for Markov decision processes with arbitrarily changing rewards and transitions. arXiv preprint arXiv:1805.10066, 2018.
    Findings
  • Garcia, A. and Smith, R. L. Solving nonstationary infinite horizon dynamic optimization problems. Journal of Mathematical Analysis and Applications, 244(2):304– 317, 2000.
    Google ScholarLocate open access versionFindings
  • Ghate, A. and Smith, R. L. A linear programming approach to nonstationary infinite-horizon Markov decision processes. Operations Research, 61(2):413–425, 2013.
    Google ScholarLocate open access versionFindings
  • Greene, W. H. Econometric analysis. Pearson Education India, 2003.
    Google ScholarFindings
  • Greensmith, E., Bartlett, P. L., and Baxter, J. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov): 1471–1530, 2004.
    Google ScholarLocate open access versionFindings
  • Guo, Z., Thomas, P. S., and Brunskill, E. Using options and covariance testing for long horizon off-policy policy evaluation. In Advances in Neural Information Processing Systems, pp. 2492–2501, 2017.
    Google ScholarLocate open access versionFindings
  • Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H., Gupta, A., Abbeel, P., et al. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018.
    Findings
  • Hachiya, H., Sugiyama, M., and Ueda, N. Importanceweighted least-squares probabilistic classifier for covariate shift adaptation with application to human activity recognition. Neurocomputing, 80:93–101, 2012.
    Google ScholarLocate open access versionFindings
  • Hopp, W. J., Bean, J. C., and Smith, R. L. A new optimality criterion for nonhomogeneous Markov decision processes. Operations Research, 35(6):875–883, 1987.
    Google ScholarLocate open access versionFindings
  • Jacobsen, A., Schlegel, M., Linke, C., Degris, T., White, A., and White, M. Meta-descent for online, continual prediction. In AAAI Conference on Artificial Intelligence, 2019.
    Google ScholarLocate open access versionFindings
  • Foerster, J., Chen, R. Y., Al-Shedivat, M., Whiteson, S., Abbeel, P., and Mordatch, I. Learning with opponentlearning awareness. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pp. 122–130. International Foundation for Autonomous Agents and Multiagent Systems, 2018.
    Google ScholarLocate open access versionFindings
  • Jagerman, R., Markov, I., and de Rijke, M. When people change their mind: Off-policy evaluation in nonstationary recommendation environments. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, Melbourne, VIC, Australia, February 11-15, 2019, 2019.
    Google ScholarLocate open access versionFindings
  • Jiang, N. and Li, L. Doubly robust off-policy value evaluation for reinforcement learning. arXiv preprint arXiv:1511.03722, 2015.
    Findings
  • Jong, N. K. and Stone, P. Bayesian models of nonstationary Markov decision processes. Planning and Learning in A Priori Unknown or Dynamic Domains, pp. 132, 2005.
    Google ScholarFindings
  • Kearney, A., Veeriah, V., Travnik, J. B., Sutton, R. S., and Pilarski, P. M. TIDBD: Adapting temporal-difference step-sizes through stochastic meta-descent. arXiv preprint arXiv:1804.03334, 2018.
    Findings
  • Lecarpentier, E. and Rachelson, E. Non-stationary Markov decision processes a worst-case approach using model-based reinforcement learning. arXiv preprint arXiv:1904.10090, 2019.
    Findings
  • Levine, N., Crammer, K., and Mannor, S. Rotting bandits. In Advances in Neural Information Processing Systems, pp. 3074–3083, 2017.
    Google ScholarLocate open access versionFindings
  • Li, C. and de Rijke, M. Cascading non-stationary bandits: Online learning to rank in the non-stationary cascade model. arXiv preprint arXiv:1905.12370, 2019.
    Findings
  • Li, Y., Zhong, A., Qu, G., and Li, N. Online Markov decision processes with time-varying transition probabilities and rewards. In Real-world Sequential Decision Making workshop at ICML 2019, 2019.
    Google ScholarFindings
  • Lu, K., Mordatch, I., and Abbeel, P. Adaptive online planning for continual lifelong learning. arXiv preprint arXiv:1912.01188, 2019.
    Findings
  • Mahmood, A. R., van Hasselt, H. P., and Sutton, R. S. Weighted importance sampling for off-policy learning with linear function approximation. In Advances in Neural Information Processing Systems, pp. 3014–3022, 2014.
    Google ScholarLocate open access versionFindings
  • Mahmud, M. and Ramamoorthy, S. Learning in nonstationary mdps as transfer learning. In Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems, pp. 1259–1260. International Foundation for Autonomous Agents and Multiagent Systems, 2013.
    Google ScholarLocate open access versionFindings
  • Man, C. D., Micheletto, F., Lv, D., Breton, M., Kovatchev, B., and Cobelli, C. The UVA/PADOVA type 1 diabetes simulator: New features. Journal of Diabetes Science and Technology, 8(1):26–34, 2014.
    Google ScholarLocate open access versionFindings
  • Mohri, M. and Yang, S. Accelerating online convex optimization via adaptive prediction. In Artificial Intelligence and Statistics, pp. 848–856, 2016.
    Google ScholarLocate open access versionFindings
  • Moore, B. L., Pyeatt, L. D., Kulkarni, V., Panousis, P., Padrez, K., and Doufas, A. G. Reinforcement learning for closed-loop propofol anesthesia: A study in human volunteers. The Journal of Machine Learning Research, 15(1):655–696, 2014.
    Google ScholarLocate open access versionFindings
  • Moulines, E. On upper-confidence bound policies for non-stationary bandit problems. arXiv preprint arXiv:0805.3415, 2008.
    Findings
  • Nagabandi, A., Clavera, I., Liu, S., Fearing, R. S., Abbeel, P., Levine, S., and Finn, C. Learning to adapt in dynamic, real-world environments through meta-reinforcement learning. arXiv preprint arXiv:1803.11347, 2018a.
    Findings
  • Nagabandi, A., Finn, C., and Levine, S. Deep online learning via meta-learning: Continual adaptation for modelbased rl. arXiv preprint arXiv:1812.07671, 2018b.
    Findings
  • Ornik, M. and Topcu, U. Learning and planning for time-varying mdps using maximum likelihood estimation. arXiv preprint arXiv:1911.12976, 2019.
    Findings
  • Padakandla, S. A survey of reinforcement learning algorithms for dynamically varying environments. arXiv preprint arXiv:2005.10619, 2020.
    Findings
  • Padakandla, S., J., P. K., and Bhatnagar, S. Reinforcement learning in non-stationary environments. CoRR, abs/1905.03970, 2019.
    Findings
  • Precup, D. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, 2000.
    Google ScholarFindings
  • Rakhlin, A. and Sridharan, K. Online learning with predictable sequences. arXiv preprint arXiv:1208.3728, 2013.
    Findings
  • Ring, M. B. Continual learning in reinforcement environments. PhD thesis, University of Texas at Austin, Texas 78712, 1994.
    Google ScholarFindings
  • Rosenbaum, P. R. and Rubin, D. B. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55, 1983.
    Google ScholarLocate open access versionFindings
  • Schmidhuber, J. A general method for incremental selfimprovement and multi-agent learning. In Evolutionary Computation: Theory and Applications, pp. 81–123. World Scientific, 1999.
    Google ScholarLocate open access versionFindings
  • Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
    Findings
  • Seznec, J., Locatelli, A., Carpentier, A., Lazaric, A., and Valko, M. Rotting bandits are no harder than stochastic ones. arXiv preprint arXiv:1811.11043, 2018.
    Findings
  • Shalev-Shwartz, S. et al. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107–194, 2012.
    Google ScholarLocate open access versionFindings
  • Singh, S., Kearns, M., and Mansour, Y. Nash convergence of gradient dynamics in general-sum games. In Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence, pp. 541–548. Morgan Kaufmann Publishers Inc., 2000.
    Google ScholarLocate open access versionFindings
  • Yang, S. and Mohri, M. Optimistic bandit convex optimization. In Advances in Neural Information Processing Systems, pp. 2297–2305, 2016.
    Google ScholarLocate open access versionFindings
  • Yu, J. Y. and Mannor, S. Online learning in Markov decision processes with arbitrarily changing rewards and transitions. In 2009 International Conference on Game Theory for Networks, pp. 314–322. IEEE, 2009.
    Google ScholarLocate open access versionFindings
  • Zhang, C. and Lesser, V. Multi-agent learning with policy prediction. In Twenty-fourth AAAI conference on artificial intelligence, 2010.
    Google ScholarLocate open access versionFindings
  • Sinha, S. and Ghate, A. Policy iteration for robust nonstationary Markov decision processes. Optimization Letters, 10(8):1613–1628, 2016.
    Google ScholarLocate open access versionFindings
  • Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018.
    Google ScholarFindings
  • Thomas, P. and Brunskill, E. Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pp. 2139–2148, 2016.
    Google ScholarLocate open access versionFindings
  • Thomas, P. S. Safe reinforcement learning. PhD thesis, University of Massachusetts Libraries, 2015.
    Google ScholarFindings
  • Thomas, P. S., Theocharous, G., Ghavamzadeh, M., Durugkar, I., and Brunskill, E. Predictive off-policy policy evaluation for nonstationary decision problems, with applications to digital marketing. In Twenty-Ninth Innovative Applications of Artificial Intelligence Conference, 2017.
    Google ScholarLocate open access versionFindings
  • Wagener, N., Cheng, C.-A., Sacks, J., and Boots, B. An online learning approach to model predictive control. arXiv preprint arXiv:1902.08967, 2019.
    Findings
  • Wang, J.-K., Li, X., and Li, P. Optimistic adaptive acceleration for optimization. arXiv preprint arXiv:1903.01435, 2019a.
    Findings
  • Wang, L., Zhou, H., Li, B., Varshney, L. R., and Zhao, Z. Be aware of non-stationarity: Nearly optimal algorithms for piecewise-stationary cascading bandits. arXiv preprint arXiv:1909.05886, 2019b.
    Findings
  • Xie, A., Harrison, J., and Finn, C. Deep reinforcement learning amidst lifelong non-stationarity. arXiv preprint arXiv:2006.10701, 2020.
    Findings
  • Xie, J. Simglucose v0.2.1 (2018), 2019. URL https://github.com/jxx123/simglucose.
    Findings
Full Text
Your rating :
0

 

Tags
Comments