# Learning Adversarial Markov Decision Processes with Bandit Feedback and Unknown Transition

ICML, pp. 4860-4869, 2020.

EI

微博一下：

摘要：

We consider the task of learning in episodic finitehorizon Markov decision processes with an unknown transition function, bandit feedback, and adversarial losses. We propose an efficient algorithm that achieves O(L|X| |A|T ) regret with high probability, where L is the horizon, |X| the number of states, |A| the number of actions, and T
th...更多

代码：

数据：

ZH

简介

- Reinforcement learning studies the problem where a learner interacts with the environment sequentially and aims to improve her strategy over time.
- The majority of the literature in learning MDPs assumes stationary losses, that is, the losses observed for a specific state-action pair follow a fixed and unknown distribution.
- To better capture applications with non-stationary or even adversarial losses, the works (Even-Dar et al, 2009; Yu et al, 2009) are among the first to study the problem of learning adversarial MDPs, where the losses can change arbitrarily between episodes.

重点内容

- Reinforcement learning studies the problem where a learner interacts with the environment sequentially and aims to improve her strategy over time
- The environment dynamics are usually modeled as a Markov Decision Process (MDP) with a fixed and unknown transition function
- Within each episode the learner sequentially observes her current state, selects an action, suffers and observes the loss corresponding to the chosen state-action pair, and transits to the state according to the underlying transition function
- The majority of the literature in learning Markov Decision Process assumes stationary losses, that is, the losses observed for a specific state-action pair follow a fixed and unknown distribution
- IOn(t√hiTs work, we propose the first efficient algorithm with ) regret for learning Markov Decision Process with unknown transition function, adversarial losses, and bandit feedback
- Our main algorithmic contribution is to propose a tighter confidence bound together with a novel optimistic loss estimator based on upper occupancy bounds

结果

- The authors' main contribution significantly improves on (Rosenberg & Mansour, 2019b).

结论

- IOn(t√hiTs work, the authors propose the first efficient algorithm with ) regret for learning MDPs with unknown transition function, adversarial losses, and bandit feedback.
- The authors' main algorithmic contribution is to propose a tighter confidence bound together with a novel optimistic loss estimator based on upper occupancy bounds.
- One natural open problem in this direction is to close the gap between the regret upper bound O(L|X| |A|T ) and the lower bound of Ω(L |X||A|T ) (Jin et al, 2018), which exists even for the full-information setting

总结

## Introduction:

Reinforcement learning studies the problem where a learner interacts with the environment sequentially and aims to improve her strategy over time.- The majority of the literature in learning MDPs assumes stationary losses, that is, the losses observed for a specific state-action pair follow a fixed and unknown distribution.
- To better capture applications with non-stationary or even adversarial losses, the works (Even-Dar et al, 2009; Yu et al, 2009) are among the first to study the problem of learning adversarial MDPs, where the losses can change arbitrarily between episodes.
## Results:

The authors' main contribution significantly improves on (Rosenberg & Mansour, 2019b).## Conclusion:

IOn(t√hiTs work, the authors propose the first efficient algorithm with ) regret for learning MDPs with unknown transition function, adversarial losses, and bandit feedback.- The authors' main algorithmic contribution is to propose a tighter confidence bound together with a novel optimistic loss estimator based on upper occupancy bounds.
- One natural open problem in this direction is to close the gap between the regret upper bound O(L|X| |A|T ) and the lower bound of Ω(L |X||A|T ) (Jin et al, 2018), which exists even for the full-information setting

相关工作

- Stochastic losses. Learning MDPs with stochastic losses and bandit feedback is relatively well-studied for the tabular case (that is, finite number of states and actions). For example, in the episodic setting, using our notation,3 the UCRL2 algorithm of Jaksch et al (2010) achieves O( L3|X|2|A|T ) regret, and the UCBVI algorithm of Azar et al (2017) achieves the optimal bound O(L |X||A|T ), both of which are model-based algorithms and construct confidence sets for both the transition function and the loss function. The recent work (Jin et al, 2018) achieves a suboptimal bound O( L3|X||A|T ) via an optimistic Q-learning algorithm that is model-free. Besides the episodic setting, other setups such as discounted losses or infinite-horizon average-loss setting have also been heavily studied; see for example (Ouyang et al, 2017; Fruit et al, 2018; Zhang & Ji, 2019; Wei et al, 2019; Wang et al, 2019) for some recent works.

Adversarial losses. Based on whether the transition function is known and whether the feedback is full-information or bandit, we discuss four categories separately.

基金

- HL is supported by NSF Awards IIS-1755781 and IIS1943607
- SS is partially supported by NSF-BIGDATA Award IIS-1741341 and an NSF-CAREER grant Award IIS-1846088
- TY is partially supported by NSF BIGDATA grant IIS-1741341

引用论文

- Abbasi-Yadkori, Y., Pal, D., and Szepesvari, C. Improved algorithms for linear stochastic bandits. In Proceedings of the 24th International Conference on Neural Information Processing Systems, pp. 2312–2320, 2011.
- Abernethy, J. D., Hazan, E., and Rakhlin, A. Competing in the dark: An efficient algorithm for bandit linear optimization. In Proceedings of the 21st Annual Conference on Learning Theory, pp. 263–274, 2008.
- Allenberg, C., Auer, P., Gyorfi, L., and Ottucsak, G. Hannan consistency in on-line learning in case of unbounded losses under partial monitoring. In Proceedings of the 17th international conference on Algorithmic Learning Theory, pp. 229–243, 2006.
- Altman, E. Constrained Markov decision processes, volume 7. CRC Press, 1999.
- Arora, R., Dekel, O., and Tewari, A. Deterministic mdps with adversarial rewards and bandit feedback. In Proceedings of the 28th Conference on Uncertainty in Artificial Intelligence, pp. 93–101, 2012.
- Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002a.
- Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1), 2002b.
- Azar, M. G., Osband, I., and Munos, R. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning, pp. 263–272, 2017.
- Beygelzimer, A., Langford, J., Li, L., Reyzin, L., and Schapire, R. Contextual bandit algorithms with supervised learning guarantees. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 19–26, 2011.
- Burnetas, A. N. and Katehakis, M. N. Optimal adaptive policies for markov decision processes. Mathematics of Operations Research, 22(1):222–255, 1997.
- Cheung, W. C., Simchi-Levi, D., and Zhu, R. Reinforcement learning under drift. arXiv preprint arXiv:1906.02922, 2019.
- Chu, W., Li, L., Reyzin, L., and Schapire, R. Contextual bandits with linear payoff functions. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, pp. 208–214, 2011.
- Dekel, O. and Hazan, E. Better rates for any adversarial deterministic mdp. In Proceedings of the 30th International Conference on Machine Learning, pp. 675–683, 2013.
- Dekel, O., Ding, J., Koren, T., and Peres, Y. Bandits with switching costs: T 2/3 regret. In Proceedings of the 46th annual ACM symposium on Theory of computing, pp. 459–467, 2014.
- Even-Dar, E., Kakade, S. M., and Mansour, Y. Online markov decision processes. Mathematics of Operations Research, 34(3):726–736, 2009.
- Fruit, R., Pirotta, M., Lazaric, A., and Ortner, R. Efficient bias-span-constrained exploration-exploitation in reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning, pp. 1578– 1586, 2018.
- Hazan, E. et al. Introduction to online convex optimization. Foundations and Trends R in Optimization, 2(3-4):157– 325, 2016.
- Jaksch, T., Ortner, R., and Auer, P. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
- Jin, C., Allen-Zhu, Z., Bubeck, S., and Jordan, M. I. Is q-learning provably efficient? In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 4868–4878, 2018.
- Lykouris, T., Simchowitz, M., Slivkins, A., and Sun, W. Corruption robust exploration in episodic reinforcement learning. arXiv preprint arXiv:1911.08689, 2019.
- Maurer, A. and Pontil, M. Empirical bernstein bounds and sample variance penalization. In Proceedings of the 22nd Annual Conference on Learning Theory, 2009.
- Neu, G. Explore no more: Improved high-probability regret bounds for non-stochastic bandits. In Advances in Neural Information Processing Systems, pp. 3168–3176, 2015.
- Neu, G., Gyorgy, A., and Szepesvari, C. The online loopfree stochastic shortest-path problem. In Proceedings of the 23rd Annual Conference on Learning Theory, pp. 231–243, 2010.
- Neu, G., Gyorgy, A., and Szepesvari, C. The adversarial stochastic shortest path problem with unknown transition probabilities. In Proceedings of the 15th International Conference on Artificial Intelligence and Statistics, pp. 805–813, 2012.
- Neu, G., Antos, A., Gyorgy, A., and Szepesvari, C. Online markov decision processes under bandit feedback. IEEE Transactions on Automatic Control, pp. 676 – 691, 2014.
- Ouyang, Y., Gagrani, M., Nayyar, A., and Jain, R. Learning unknown markov decision processes: a thompson sampling approach. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 1333–1342, 2017.
- Rosenberg, A. and Mansour, Y. Online convex optimization in adversarial Markov decision processes. In Proceedings of the 36th International Conference on Machine Learning, pp. 5478–5486, 2019a.
- Rosenberg, A. and Mansour, Y. Online stochastic shortest path with bandit feedback and unknown transition function. In Advances in Neural Information Processing Systems, 2019b.
- Wang, Y., Dong, K., Chen, X., and Wang, L. Q-learning with ucb exploration is sample efficient for infinite-horizon mdp. In International Conference on Learning Representations, 2019.
- Wei, C.-Y., Jafarnia-Jahromi, M., Luo, H., Sharma, H., and Jain, R. Model-free reinforcement learning in infinitehorizon average-reward markov decision processes. arXiv preprint arXiv:1910.07072, 2019.
- Yu, J. Y. and Mannor, S. Arbitrarily modulated markov decision processes. In Proceedings of the 48h IEEE Conference on Decision and Control, pp. 2946–2953, 2009.
- Yu, J. Y., Mannor, S., and Shimkin, N. Markov decision processes with arbitrary reward processes. Mathematics of Operations Research, 34(3):737–757, 2009.
- Zhang, Z. and Ji, X. Regret minimization for reinforcement learning by evaluating the optimal bias function. In Advances in Neural Information Processing Systems, 2019.
- Zimin, A. and Neu, G. Online learning in episodic markovian decision processes by relative entropy policy search. In Proceedings of the 26th International Conference on Neural Information Processing Systems, pp. 1583–1591, 2013.

标签

评论