AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
IOn(t√hiTs work, we propose the first efficient algorithm with ) regret for learning Markov Decision Process with unknown transition function, adversarial losses, and bandit feedback

Learning Adversarial Markov Decision Processes with Bandit Feedback and Unknown Transition

ICML, pp.4860-4869, (2020)

被引用0|浏览101
EI
下载 PDF 全文
引用
微博一下

摘要

We consider the task of learning in episodic finitehorizon Markov decision processes with an unknown transition function, bandit feedback, and adversarial losses. We propose an efficient algorithm that achieves O(L|X| |A|T ) regret with high probability, where L is the horizon, |X| the number of states, |A| the number of actions, and T th...更多

代码

数据

简介
  • Reinforcement learning studies the problem where a learner interacts with the environment sequentially and aims to improve her strategy over time.
  • The majority of the literature in learning MDPs assumes stationary losses, that is, the losses observed for a specific state-action pair follow a fixed and unknown distribution.
  • To better capture applications with non-stationary or even adversarial losses, the works (Even-Dar et al, 2009; Yu et al, 2009) are among the first to study the problem of learning adversarial MDPs, where the losses can change arbitrarily between episodes.
重点内容
  • Reinforcement learning studies the problem where a learner interacts with the environment sequentially and aims to improve her strategy over time
  • The environment dynamics are usually modeled as a Markov Decision Process (MDP) with a fixed and unknown transition function
  • Within each episode the learner sequentially observes her current state, selects an action, suffers and observes the loss corresponding to the chosen state-action pair, and transits to the state according to the underlying transition function
  • The majority of the literature in learning Markov Decision Process assumes stationary losses, that is, the losses observed for a specific state-action pair follow a fixed and unknown distribution
  • IOn(t√hiTs work, we propose the first efficient algorithm with ) regret for learning Markov Decision Process with unknown transition function, adversarial losses, and bandit feedback
  • Our main algorithmic contribution is to propose a tighter confidence bound together with a novel optimistic loss estimator based on upper occupancy bounds
结果
结论
  • IOn(t√hiTs work, the authors propose the first efficient algorithm with ) regret for learning MDPs with unknown transition function, adversarial losses, and bandit feedback.
  • The authors' main algorithmic contribution is to propose a tighter confidence bound together with a novel optimistic loss estimator based on upper occupancy bounds.
  • One natural open problem in this direction is to close the gap between the regret upper bound O(L|X| |A|T ) and the lower bound of Ω(L |X||A|T ) (Jin et al, 2018), which exists even for the full-information setting
总结
  • Introduction:

    Reinforcement learning studies the problem where a learner interacts with the environment sequentially and aims to improve her strategy over time.
  • The majority of the literature in learning MDPs assumes stationary losses, that is, the losses observed for a specific state-action pair follow a fixed and unknown distribution.
  • To better capture applications with non-stationary or even adversarial losses, the works (Even-Dar et al, 2009; Yu et al, 2009) are among the first to study the problem of learning adversarial MDPs, where the losses can change arbitrarily between episodes.
  • Results:

    The authors' main contribution significantly improves on (Rosenberg & Mansour, 2019b).
  • Conclusion:

    IOn(t√hiTs work, the authors propose the first efficient algorithm with ) regret for learning MDPs with unknown transition function, adversarial losses, and bandit feedback.
  • The authors' main algorithmic contribution is to propose a tighter confidence bound together with a novel optimistic loss estimator based on upper occupancy bounds.
  • One natural open problem in this direction is to close the gap between the regret upper bound O(L|X| |A|T ) and the lower bound of Ω(L |X||A|T ) (Jin et al, 2018), which exists even for the full-information setting
相关工作
  • Stochastic losses. Learning MDPs with stochastic losses and bandit feedback is relatively well-studied for the tabular case (that is, finite number of states and actions). For example, in the episodic setting, using our notation,3 the UCRL2 algorithm of Jaksch et al (2010) achieves O( L3|X|2|A|T ) regret, and the UCBVI algorithm of Azar et al (2017) achieves the optimal bound O(L |X||A|T ), both of which are model-based algorithms and construct confidence sets for both the transition function and the loss function. The recent work (Jin et al, 2018) achieves a suboptimal bound O( L3|X||A|T ) via an optimistic Q-learning algorithm that is model-free. Besides the episodic setting, other setups such as discounted losses or infinite-horizon average-loss setting have also been heavily studied; see for example (Ouyang et al, 2017; Fruit et al, 2018; Zhang & Ji, 2019; Wei et al, 2019; Wang et al, 2019) for some recent works.

    Adversarial losses. Based on whether the transition function is known and whether the feedback is full-information or bandit, we discuss four categories separately.
基金
  • HL is supported by NSF Awards IIS-1755781 and IIS1943607
  • SS is partially supported by NSF-BIGDATA Award IIS-1741341 and an NSF-CAREER grant Award IIS-1846088
  • TY is partially supported by NSF BIGDATA grant IIS-1741341
引用论文
  • Abbasi-Yadkori, Y., Pal, D., and Szepesvari, C. Improved algorithms for linear stochastic bandits. In Proceedings of the 24th International Conference on Neural Information Processing Systems, pp. 2312–2320, 2011.
    Google ScholarLocate open access versionFindings
  • Abernethy, J. D., Hazan, E., and Rakhlin, A. Competing in the dark: An efficient algorithm for bandit linear optimization. In Proceedings of the 21st Annual Conference on Learning Theory, pp. 263–274, 2008.
    Google ScholarLocate open access versionFindings
  • Allenberg, C., Auer, P., Gyorfi, L., and Ottucsak, G. Hannan consistency in on-line learning in case of unbounded losses under partial monitoring. In Proceedings of the 17th international conference on Algorithmic Learning Theory, pp. 229–243, 2006.
    Google ScholarLocate open access versionFindings
  • Altman, E. Constrained Markov decision processes, volume 7. CRC Press, 1999.
    Google ScholarFindings
  • Arora, R., Dekel, O., and Tewari, A. Deterministic mdps with adversarial rewards and bandit feedback. In Proceedings of the 28th Conference on Uncertainty in Artificial Intelligence, pp. 93–101, 2012.
    Google ScholarLocate open access versionFindings
  • Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002a.
    Google ScholarLocate open access versionFindings
  • Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1), 2002b.
    Google ScholarLocate open access versionFindings
  • Azar, M. G., Osband, I., and Munos, R. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning, pp. 263–272, 2017.
    Google ScholarLocate open access versionFindings
  • Beygelzimer, A., Langford, J., Li, L., Reyzin, L., and Schapire, R. Contextual bandit algorithms with supervised learning guarantees. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 19–26, 2011.
    Google ScholarLocate open access versionFindings
  • Burnetas, A. N. and Katehakis, M. N. Optimal adaptive policies for markov decision processes. Mathematics of Operations Research, 22(1):222–255, 1997.
    Google ScholarLocate open access versionFindings
  • Cheung, W. C., Simchi-Levi, D., and Zhu, R. Reinforcement learning under drift. arXiv preprint arXiv:1906.02922, 2019.
    Findings
  • Chu, W., Li, L., Reyzin, L., and Schapire, R. Contextual bandits with linear payoff functions. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, pp. 208–214, 2011.
    Google ScholarLocate open access versionFindings
  • Dekel, O. and Hazan, E. Better rates for any adversarial deterministic mdp. In Proceedings of the 30th International Conference on Machine Learning, pp. 675–683, 2013.
    Google ScholarLocate open access versionFindings
  • Dekel, O., Ding, J., Koren, T., and Peres, Y. Bandits with switching costs: T 2/3 regret. In Proceedings of the 46th annual ACM symposium on Theory of computing, pp. 459–467, 2014.
    Google ScholarFindings
  • Even-Dar, E., Kakade, S. M., and Mansour, Y. Online markov decision processes. Mathematics of Operations Research, 34(3):726–736, 2009.
    Google ScholarLocate open access versionFindings
  • Fruit, R., Pirotta, M., Lazaric, A., and Ortner, R. Efficient bias-span-constrained exploration-exploitation in reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning, pp. 1578– 1586, 2018.
    Google ScholarLocate open access versionFindings
  • Hazan, E. et al. Introduction to online convex optimization. Foundations and Trends R in Optimization, 2(3-4):157– 325, 2016.
    Google ScholarLocate open access versionFindings
  • Jaksch, T., Ortner, R., and Auer, P. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
    Google ScholarLocate open access versionFindings
  • Jin, C., Allen-Zhu, Z., Bubeck, S., and Jordan, M. I. Is q-learning provably efficient? In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 4868–4878, 2018.
    Google ScholarLocate open access versionFindings
  • Lykouris, T., Simchowitz, M., Slivkins, A., and Sun, W. Corruption robust exploration in episodic reinforcement learning. arXiv preprint arXiv:1911.08689, 2019.
    Findings
  • Maurer, A. and Pontil, M. Empirical bernstein bounds and sample variance penalization. In Proceedings of the 22nd Annual Conference on Learning Theory, 2009.
    Google ScholarLocate open access versionFindings
  • Neu, G. Explore no more: Improved high-probability regret bounds for non-stochastic bandits. In Advances in Neural Information Processing Systems, pp. 3168–3176, 2015.
    Google ScholarLocate open access versionFindings
  • Neu, G., Gyorgy, A., and Szepesvari, C. The online loopfree stochastic shortest-path problem. In Proceedings of the 23rd Annual Conference on Learning Theory, pp. 231–243, 2010.
    Google ScholarLocate open access versionFindings
  • Neu, G., Gyorgy, A., and Szepesvari, C. The adversarial stochastic shortest path problem with unknown transition probabilities. In Proceedings of the 15th International Conference on Artificial Intelligence and Statistics, pp. 805–813, 2012.
    Google ScholarLocate open access versionFindings
  • Neu, G., Antos, A., Gyorgy, A., and Szepesvari, C. Online markov decision processes under bandit feedback. IEEE Transactions on Automatic Control, pp. 676 – 691, 2014.
    Google ScholarLocate open access versionFindings
  • Ouyang, Y., Gagrani, M., Nayyar, A., and Jain, R. Learning unknown markov decision processes: a thompson sampling approach. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 1333–1342, 2017.
    Google ScholarLocate open access versionFindings
  • Rosenberg, A. and Mansour, Y. Online convex optimization in adversarial Markov decision processes. In Proceedings of the 36th International Conference on Machine Learning, pp. 5478–5486, 2019a.
    Google ScholarLocate open access versionFindings
  • Rosenberg, A. and Mansour, Y. Online stochastic shortest path with bandit feedback and unknown transition function. In Advances in Neural Information Processing Systems, 2019b.
    Google ScholarLocate open access versionFindings
  • Wang, Y., Dong, K., Chen, X., and Wang, L. Q-learning with ucb exploration is sample efficient for infinite-horizon mdp. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Wei, C.-Y., Jafarnia-Jahromi, M., Luo, H., Sharma, H., and Jain, R. Model-free reinforcement learning in infinitehorizon average-reward markov decision processes. arXiv preprint arXiv:1910.07072, 2019.
    Findings
  • Yu, J. Y. and Mannor, S. Arbitrarily modulated markov decision processes. In Proceedings of the 48h IEEE Conference on Decision and Control, pp. 2946–2953, 2009.
    Google ScholarLocate open access versionFindings
  • Yu, J. Y., Mannor, S., and Shimkin, N. Markov decision processes with arbitrary reward processes. Mathematics of Operations Research, 34(3):737–757, 2009.
    Google ScholarLocate open access versionFindings
  • Zhang, Z. and Ji, X. Regret minimization for reinforcement learning by evaluating the optimal bias function. In Advances in Neural Information Processing Systems, 2019.
    Google ScholarLocate open access versionFindings
  • Zimin, A. and Neu, G. Online learning in episodic markovian decision processes by relative entropy policy search. In Proceedings of the 26th International Conference on Neural Information Processing Systems, pp. 1583–1591, 2013.
    Google ScholarLocate open access versionFindings
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科