Is Plug-in Solver Sample-Efficient for Feature-based Reinforcement Learning?

NIPS 2020, (2020)

EI

It is believed that a model-based approach for reinforcement learning (RL) is the key to reduce sample complexity. However, the understanding of the sample optimality of model-based RL is still largely missing, even for the linear case. This work considers sample complexity of finding an $\epsilon$-optimal policy in a Markov decision pr...更多

• Reinforcement learning (RL) [Sutton and Barto, 2018] is about learning to make optimal decisions in an unknown environment.
• In the tabular setting, where the state and action spaces, S and A, are finite, [Azar et al, 2013] shows that the value estimation of a plug-in approach is minimax optimal in samples.
• [Azar et al, 2013] proves a minimax sample complexity O(|S||A|/ 2(1 − γ)3) of model-based algorithm for value estimation via the total variance technique.

• Reinforcement learning (RL) [Sutton and Barto, 2018] is about learning to make optimal decisions in an unknown environment
• We show that the plug-in solver approach do achieve near-optimal sample complexity even in the feature-based setting, provided that the features are well conditioned
• We show that under an anchor-state condition, where all features can be represented by the convex combination of some anchor-state features, an -optimal policy can be obtained from an approximate model with only O(K/ 2(1 − γ)3) samples from the generative model, where K is the feature dimension, independent of the size of state and action spaces
• We have extended our techniques to other settings e.g. finite horizon Markov decision process (MDP) (FHMDP) and two-players turn-based stochastic games (2-TBSG)
• This paper studies the sample complexity of plug-in solver approach in feature-based MDPs, including discounted MDP, finite horizon MDP and stochastic games
• The sample complexity we give for finite horizon MDP is O(KH4 −2), which has an extra H dependence compared with the discounted case

• Anchor-state assumption is the key to achieve minimax sample complexity, which is used in analyzing both the linear transition model and linear Q model [Yang and Wang, Zanette et al, 2019].
• If the authors have N i.i.d. samples from each state-action pair in K, an unbiased estimate P of the transition model is obtained from Algorithm 1: P (s |s, a) = λsk,aPK(s |sk, ak), k∈K
• The authors gives a minimax sample complexity of Algorithm 1 for feature-based MDP with anchor state assumption.
• (Sample complexity for DMDP) Suppose Assumption 1 is satisfied and the empirical model M is constructed as in Algorithm 1.
• Suppose the samples are from state-action pairs K and the unbiased estimate of the transition probability matrix where {λsk,a} is defined as in given by Algorithm 1 is P .
• This assumption means that the selected state-action pairs K can represent all features by linear combinations with bounded coefficients, which avoids error explosion in the iterative algorithm.
• (Value iteration solver for general linear MDP) Suppose Assumption 2 is satisfied and the empirical model M is constructed as in Algorithm 1.
• (Sample complexity for FHMDP) Suppose Assumption 1 is satisfied and the empirical model M is constructed as in Algorithm 1.
• (Sample complexity for 2-TBSG) Suppose Assumption 1 is satisfied and the empirical model M is constructed as in Algorithm 1.
• This paper studies the sample complexity of plug-in solver approach in feature-based MDPs, including discounted MDP, finite horizon MDP and stochastic games.

• This is the first result proving minimax sample complexity for the plug-in solver approach in feature-based MDPs. The authors hope that the new technique in the work can be reused in more general settings and motivate breakthroughs in other domains.
• The authors conjecture that the plug-in solver approach should enjoy the optimal O(KH3 −2) complexity as model-free algorithms [Yang and Wang].
• One interesting problem is whether the authors can develop provably efficient model-based algorithm under general function approximation setting, as the construction of the empirical model seems to be difficult even for linear Q-function assumption.

• Table1: Sample complexity to compute -optimal policy with generative model

• Generative Model There is a line of research focusing on improving the sample complexity with a generative model, e.g. [Kearns and Singh, 1999, Kakade et al, 2003, Azar et al, 2012, 2013, Sidford et al, 2018a,b, Yang and Wang, Sidford et al, Zanette et al, 2019, Li et al, 2020]. A classic algorithm under generative model setting is phased Q-learning [Kearns and Singh, 1999]. It uses O(|S||A|/ 2/poly(1 − γ)) samples to find an -optimal policy, which is sublinear to the model size |S|2|A|. Sample complexity lower bound for generative model has been established in [Azar et al, 2013, Yang and Wang, Sidford et al.]. In particular, Azar et al [2013] gives the first tight lower bound for unstructured discounted MDP. Later, this lower bound is generalized to feature-based MDP and two-players turn-based stochastic game in [Yang and Wang, Sidford et al.]. [Azar et al, 2013] also proves a minimax sample complexity O(|S||A|/ 2(1 − γ)3) of model-based algorithm for value estimation via the total variance technique. However, the sample complexity of value estimation and policy estimation differs in a factor of O(1/(1 − γ)2) [Singh and Yee, 1994]. The first minimax policy estimation result is given in [Sidford et al, 2018a], which proposes an model-free algorithm, known as Variance Reduced Q-value Iteration. This work has been extended to two-players turn-based stochastic game in [Sidford et al, Jia et al, 2019]. Recently, [Yang and Wang] develops an sample-optimal algorithm called Optimal Phased Parametric Q-Lear√ning for feature-based RL. Their result requires ∈ (0, 1), while our result holds for ∈ (0, 1/ 1 − γ). Plug-in solver approach is proved to be sample-optimal for tabular case in [Agarwal et al, 2019], which develops the absorbing MDP technique. However, their approach can not be generalized to linear transition model. A very recently paper [Li et al, 2020] develops a novel reward perturbation technique to remove the constraint on in tabular case.

• Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.
• Alekh Agarwal, Sham Kakade, and Lin F Yang. On the optimality of sparse model-based planning for markov decision processes. arXiv preprint arXiv:1906.03804, 2019.
• Mohammad Gheshlaghi Azar, Rémi Munos, and Bert Kappen. On the sample complexity of reinforcement learning with a generative model. arXiv preprint arXiv:1206.6461, 2012.
• Mohammad Gheshlaghi Azar, Rémi Munos, and Hilbert J Kappen. Minimax pac bounds on the sample complexity of reinforcement learning with a generative model. Machine learning, 91(3):325–349, 2013.
• Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 263–272. JMLR. org, 2017.
• Steven J Bradtke and Andrew G Barto. Linear least-squares algorithms for temporal difference learning. Machine learning, 22(1-3):33–57, 1996.
• Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 208–214, 2011.
• Simon S Du, Yuping Luo, Ruosong Wang, and Hanrui Zhang. Provably efficient q-learning with function approximation via distribution shift error checking oracle. In Advances in Neural Information Processing Systems, pages 8058–8068, 2019.
• Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pages 1329–1338, 2016.
• Yaqi Duan, Tracy Ke, and Mengdi Wang. State aggregation learning from markov transition data. In Advances in Neural Information Processing Systems, pages 4488–4497, 2019.
• Thomas Dueholm Hansen, Peter Bro Miltersen, and Uri Zwick. Strategy iteration is strongly polynomial for 2-player turn-based stochastic games with a constant discount factor. Journal of the ACM (JACM), 60(1): 1–16, 2013.
• Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
• Zeyu Jia, Lin F Yang, and Mengdi Wang. Feature-based q-learning for two-player stochastic games. arXiv preprint arXiv:1906.00423, 2019.
• Nan Jiang, Alex Kulesza, Satinder Singh, and Richard Lewis. The dependence of effective planning horizon on model accuracy. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, pages 1181–1189, 2015.
• Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. Contextual decision processes with low bellman rank are pac-learnable. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1704–1713. JMLR. org, 2017.
• Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably efficient? In Advances in Neural Information Processing Systems, pages 4863–4873, 2018.
• Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. arXiv preprint arXiv:1907.05388, 2019.
• Lukasz Kaiser, Mohammad Babaeizadeh, Piotr Milos, Blazej Osinski, Roy H Campbell, Konrad Czechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, et al. Model-based reinforcement learning for atari. arXiv preprint arXiv:1903.00374, 2019.
• Sham Machandranath Kakade et al. On the sample complexity of reinforcement learning. PhD thesis, University of London London, England, 2003.
• Michael J Kearns and Satinder P Singh. Finite-sample convergence rates for q-learning and indirect algorithms. In Advances in neural information processing systems, pages 996–1002, 1999.
• Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Chen. Breaking the sample size barrier in model-based reinforcement learning with a generative model. arXiv preprint arXiv:2005.12900, 2020.
• Yuan Li, Xiaodan Liang, Zhiting Hu, and Eric P Xing. Hybrid retrieval-generation reinforced agent for medical image report generation. In Advances in neural information processing systems, pages 1530–1540, 2018.
• Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
• Rémi Munos and Csaba Szepesvári. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9(May):815–857, 2008.
• Ian Osband and Benjamin Van Roy. Model-based reinforcement learning and the eluder dimension. In Advances in Neural Information Processing Systems, pages 1466–1474, 2014.
• Aaron Sidford, Mengdi Wang, Xian Wu, Lin Yang, and Yinyu Ye. Near-optimal time and sample complexities for solving markov decision processes with a generative model. In Advances in Neural Information Processing Systems, pages 5186–5196, 2018a.
• Aaron Sidford, Mengdi Wang, Xian Wu, and Yinyu Ye. Variance reduced value iteration and faster algorithms for solving markov decision processes. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 770–787. SIAM, 2018b.
• Satinder P Singh and Richard C Yee. An upper bound on the loss from approximate optimal-value functions. Machine Learning, 16(3):227–233, 1994.
• Satinder P Singh, Tommi Jaakkola, and Michael I Jordan. Reinforcement learning with soft state aggregation. In Advances in neural information processing systems, pages 361–368, 1995.
• Wen Sun, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, and John Langford. Model-based rl in contextual decision processes: Pac bounds and exponential improvements over model-free approaches. arXiv preprint arXiv:1811.08540, 2018.
• Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
• Oriol Vinyals, Timo Ewalds, Sergey Bartunov, Petko Georgiev, Alexander Sasha Vezhnevets, Michelle Yeo, Alireza Makhzani, Heinrich Küttler, John Agapiou, Julian Schrittwieser, et al. Starcraft ii: A new challenge for reinforcement learning. arXiv preprint arXiv:1708.04782, 2017.
• Ruosong Wang, Ruslan Salakhutdinov, and Lin F Yang. Provably efficient reinforcement learning with general value function approximation. arXiv preprint arXiv:2005.10804, 2020.
• Tingwu Wang, Xuchan Bao, Ignasi Clavera, Jerrick Hoang, Yeming Wen, Eric Langlois, Shunshi Zhang, Guodong Zhang, Pieter Abbeel, and Jimmy Ba. Benchmarking model-based reinforcement learning. arXiv preprint arXiv:1907.02057, 2019.
• Lin F Yang and Mengdi Wang. Reinforcement leaning in feature space: Matrix bandit, kernels, and regret bound. arXiv preprint arXiv:1905.10389, 2019.
• Hengshuai Yao, Csaba Szepesvári, Bernardo Avila Pires, and Xinhua Zhang. Pseudo-mdps and factored linear action models. In 2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), pages 1–9. IEEE, 2014.
• Andrea Zanette, Alessandro Lazaric, Mykel J Kochenderfer, and Emma Brunskill. Limiting extrapolation in linear approximate value iteration. In Advances in Neural Information Processing Systems, pages 5616–5625, 2019.

Qiwen Cui

0