Naive Exploration is Optimal for Online LQR

ICML, pp. 8937-8948, 2020.

Cited by: 0|Views8
EI
Weibo:
We have established that the asymptotically optimal regret for the online linear-quadratic regulator problem is Θ( d2udxT ), and that this rate is attained by ε-greedy exploration

Abstract:

We consider the problem of online adaptive control of the linear quadratic regulator, where the true system parameters are unknown. We prove new upper and lower bounds demonstrating that the optimal regret scales as $\widetilde{\Theta}({\sqrt{d_{\mathbf{u}}^2 d_{\mathbf{x}} T}})$, where $T$ is the number of time steps, $d_{\mathbf{u}}$ ...More

Code:

Data:

Full Text
Bibtex
Weibo
Introduction
  • Reinforcement learning has recently achieved great success in application domains including Atari [Mnih et al, 2015], Go [Silver et al, 2017], and robotics [Lillicrap et al, 2015]
  • All of these breakthroughs leverage data-driven methods for continuous control in large state spaces.
  • Their success, along with challenges in deploying RL in the real world, has led to renewed interest on developing continuous control algorithms with improved reliability and sample efficiency.
Highlights
  • Reinforcement learning has recently achieved great success in application domains including Atari [Mnih et al, 2015], Go [Silver et al, 2017], and robotics [Lillicrap et al, 2015]
  • Theoretical results for continuous control setting have been more elusive, with progress spread across various models [Kakade et al, 2003, Munos and Szepesvari, 2008, Jiang et al, 2017, Jin et al, 2019], but the linear-quadratic regulator (LQR) problem has recently emerged as a candidate for a standard benchmark for continuous control and RL
  • We address a curious question raised by this work: Is sophisticated exploration helpful for linear-quadratic regulator, is linear control truly substantially easier than the general reinforcement learning setting? More broadly, we aim to shed light on the question: To what extent to do sophisticated exploration strategies improve learning in online linear-quadratic control?
  • The online linear-quadratic regulator setting we study was introduced by Abbasi-Yadkori and Szepesvari [2011], which considers the problem of controlling an unknown linear system under stationary stochastic noise.5 √They showed that an algorithm based on the optimism in the face of uncertainty (OFU) principle enjoys T, but their algorithm is computationally inefficient and their regret bound depends exponentially on dimension
  • We have established that the asymptotically optimal regret for the online linear-quadratic regulator problem is Θ( d2udxT ), and that this rate is attained by ε-greedy exploration
Results
  • The authors state the main upper and lower bounds for online LQR and give a high-level overview of the proof techniques behind both results.
  • At the end of the section, the authors instantiate and compare the two bounds for the simple special case of strongly stable systems
  • Both the upper and lower bounds are motivated by the following question: Suppose that the learner is selecting near optimal control inputs ut ≈ K xt, where K = K∞(A , B ) is the optimal controller for the system (A , B ).
  • The local minimax lower bound immediately implies a lower bounds on the global minimax complexity as well.8
Conclusion
  • The authors have established that the asymptotically optimal regret for the online LQR problem is Θ( d2udxT ), and that this rate is attained by ε-greedy exploration.
  • On the purely technical side, recall that while the upper and lower bound match in terms of du, dx, and T , they differ in their polynomial dependence on P op.
  • Does closing this gap require new algorithmic techniques, or will a better analysis suffice?
Summary
  • Introduction:

    Reinforcement learning has recently achieved great success in application domains including Atari [Mnih et al, 2015], Go [Silver et al, 2017], and robotics [Lillicrap et al, 2015]
  • All of these breakthroughs leverage data-driven methods for continuous control in large state spaces.
  • Their success, along with challenges in deploying RL in the real world, has led to renewed interest on developing continuous control algorithms with improved reliability and sample efficiency.
  • Results:

    The authors state the main upper and lower bounds for online LQR and give a high-level overview of the proof techniques behind both results.
  • At the end of the section, the authors instantiate and compare the two bounds for the simple special case of strongly stable systems
  • Both the upper and lower bounds are motivated by the following question: Suppose that the learner is selecting near optimal control inputs ut ≈ K xt, where K = K∞(A , B ) is the optimal controller for the system (A , B ).
  • The local minimax lower bound immediately implies a lower bounds on the global minimax complexity as well.8
  • Conclusion:

    The authors have established that the asymptotically optimal regret for the online LQR problem is Θ( d2udxT ), and that this rate is attained by ε-greedy exploration.
  • On the purely technical side, recall that while the upper and lower bound match in terms of du, dx, and T , they differ in their polynomial dependence on P op.
  • Does closing this gap require new algorithmic techniques, or will a better analysis suffice?
Related work
  • Non-asymptotic guarantees for learning linear dynamical systems have been the subject of intense recent interest [Dean et al, Hazan et al, 2017, Tu and Recht, 2018, Hazan et al, 2018, Simchowitz et al, 2018, Sarkar and Rakhlin, 2019, Simchowitz et al, 2019, Mania et al, 2019, Sarkar et al, 2019]. The online LQR setting we study was introduced by Abbasi-Yadkori and Szepesvari [2011], which considers the problem of controlling an unknown linear system under stationary stochastic noise.5 √They showed that an algorithm based on the optimism in the face of uncertainty (OFU) principle enjoys T , but their algorithm is computationally inefficient and their regret bound depends exponentially on dimension. The problem was revisited by Dean et al [2018], who showed that an explicit explore-exploit scheme based ε-gr√eedy exploration and certainty equivalence achieves T 2/3 regret efficiently, and left the question of obtaining T regret efficiently as an open problem. This issue was subsequently add√ressed by Faradonbeh et al [2018a] and Mania √et al [2019], who showed that certainty equivalence obtains T regret, and Cohen et al [2019], who achieve T regret using a semidefinite programming relaxation for the OFU scheme. The regret bounds in Faradonbeh et al [2018a] do not specify dependence, and (for dx ≥ du), the dimension scaling of Cohen et al [2019] can be as large as d1x6T 6; Mania et al [2019] incurs a dimension dependence of d3xT (suboptimal when du dx), but at the expense of imposing a strong controllability assumption.

    The question of whether regret for online LQR could be improved further (for example, to log T ) remained open, and was√left as a conjecture by Faradonbeh et al [2018b]. Our lower bounds resolve this conjecture by showing that T -regret is optimal. Moreover, by refining the upper bounds of Mania et al [2019], our results show that the asymptotically optimal regret is Θ( d2udxT ), and that this achieved by certainty equivalence. Beyond attaining the optimal dimension dependence, our upper bounds also enjoy refined dependence on problem parameters, and do not require a-priori knowledge of these parameters.
Reference
  • Yasin Abbasi-Yadkori and Csaba Szepesvari. Regret bounds for the adaptive control of linear quadratic systems. In Proceedings of the 24th Annual Conference on Learning Theory, pages 1–26, 2011.
    Google ScholarLocate open access versionFindings
  • Marc Abeille and Alessandro Lazaric. Thompson sampling for linear-quadratic control problems. In Artificial Intelligence and Statistics, pages 1246–1254, 2017.
    Google ScholarLocate open access versionFindings
  • Marc Abeille and Alessandro Lazaric. Improved regret bounds for thompson sampling in linear quadratic control problems. In International Conference on Machine Learning, pages 1–9, 2018.
    Google ScholarLocate open access versionFindings
  • Naman Agarwal, Brian Bullins, Elad Hazan, Sham Kakade, and Karan Singh. Online control with adversarial disturbances. In International Conference on Machine Learning, pages 111–119, 2019a.
    Google ScholarLocate open access versionFindings
  • Naman Agarwal, Elad Hazan, and Karan Singh. Logarithmic regret for online control. In Advances in Neural Information Processing Systems 32, pages 10175–10184. 2019b.
    Google ScholarLocate open access versionFindings
  • Ery Arias-Castro, Emmanuel J Candes, and Mark A Davenport. On the fundamental limits of adaptive sensing. IEEE Transactions on Information Theory, 59(1):472–481, 2012.
    Google ScholarLocate open access versionFindings
  • Patrice Assouad. Deux remarques sur l’estimation. Comptes rendus des seances de l’Academie des sciences. Serie 1, Mathematique, 296(23):1021–1024, 1983.
    Google ScholarLocate open access versionFindings
  • Mohammad Gheshlaghi Azar, Ian Osband, and Remi Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 263–272. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • Dimitri P Bertsekas. Dynamic Programming and Optimal Control, Vol. I. Athena Scientific, 2005.
    Google ScholarLocate open access versionFindings
  • Nicoletta Bof, Ruggero Carli, and Luca Schenato. Lyapunov theory for discrete time systems. arXiv preprint arXiv:1809.05289, 2018.
    Findings
  • Stephen Boyd. Lecture 13: Linear quadratic lyapunov theory. EE363 Course Notes, Stanford University, 2008.
    Google ScholarFindings
  • Alon Cohen, Avinatan Hasidim, Tomer Koren, Nevena Lazic, Yishay Mansour, and Kunal Talwar. Online linear quadratic control. In International Conference on Machine Learning, pages 1028–1037, 2018.
    Google ScholarLocate open access versionFindings
  • Al√on Cohen, Tomer Koren, and Yishay Mansour. Learning linear-quadratic regulators efficiently with only T regret. In International Conference on Machine Learning, pages 1300–1309, 2019.
    Google ScholarLocate open access versionFindings
  • Christoph Dann and Emma Brunskill. Sample complexity of episodic fixed-horizon reinforcement learning. In Advances in Neural Information Processing Systems, pages 2818–2826, 2015.
    Google ScholarLocate open access versionFindings
  • Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. Regret bounds for robust adaptive control of the linear quadratic regulator. In Advances in Neural Information Processing Systems, pages 4188–4197, 2018.
    Google ScholarLocate open access versionFindings
  • Mohamad Kazem Shirani Faradonbeh, Ambuj Tewari, and George Michailidis. Input perturbations for adaptive regulation and learning. arXiv preprint arXiv:1811.04258, 2018a.
    Findings
  • Mohamad Kazem Shirani Faradonbeh, Ambuj Tewari, and George Michailidis. On optimality of adaptive linear-quadratic regulators. arXiv preprint arXiv:1806.10749, 2018b.
    Findings
  • Maryam Fazel, Rong Ge, Sham Kakade, and Mehran Mesbahi. Global convergence of policy gradient methods for the linear quadratic regulator. In International Conference on Machine Learning, pages 1466–1475, 2018.
    Google ScholarLocate open access versionFindings
  • Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2):169–192, 2007.
    Google ScholarLocate open access versionFindings
  • Elad Hazan, Karan Singh, and Cyril Zhang. Learning linear dynamical systems via spectral filtering. In Advances in Neural Information Processing Systems, pages 6702–6712, 2017.
    Google ScholarLocate open access versionFindings
  • Elad Hazan, Holden Lee, Karan Singh, Cyril Zhang, and Yi Zhang. Spectral filtering for general linear dynamical systems. In Advances in Neural Information Processing Systems, pages 4634–4643, 2018.
    Google ScholarLocate open access versionFindings
  • Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
    Google ScholarLocate open access versionFindings
  • Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. Contextual decision processes with low bellman rank are pac-learnable. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1704–1713. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. arXiv preprint arXiv:1907.05388, 2019.
    Findings
  • Sham Kakade, Michael J Kearns, and John Langford. Exploration in metric state spaces. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pages 306–312, 2003.
    Google ScholarLocate open access versionFindings
  • Michael J Kearns, Yishay Mansour, and Andrew Y Ng. Approximate planning in large pomdps via reusable trajectories. In Advances in Neural Information Processing Systems, pages 1001–1007, 2000.
    Google ScholarLocate open access versionFindings
  • John Langford and Tong Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. In Proceedings of the 20th International Conference on Neural Information Processing Systems, pages 817–824.
    Google ScholarLocate open access versionFindings
  • Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
    Findings
  • Bo Lincoln and Anders Rantzer. Relaxing dynamic programming. IEEE Transactions on Automatic Control, 51(8):1249–1260, 2006.
    Google ScholarLocate open access versionFindings
  • Horia Mania, Stephen Tu, and Benjamin Recht. Certainty equivalence is efficient for linear quadratic control. In Advances in Neural Information Processing Systems, pages 10154–10164, 2019.
    Google ScholarLocate open access versionFindings
  • Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
    Google ScholarLocate open access versionFindings
  • Remi Munos and Csaba Szepesvari. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9(May):815–857, 2008.
    Google ScholarLocate open access versionFindings
  • Yi Ouyang, Mukul Gagrani, and Rahul Jain. Control of unknown linear systems with thompson sampling. In 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 1198–1205. IEEE, 2017.
    Google ScholarLocate open access versionFindings
  • Alexander Rakhlin and Karthik Sridharan. Online nonparametric regression. In Conference on Learning Theory, 2014.
    Google ScholarLocate open access versionFindings
  • Tuhin Sarkar and Alexander Rakhlin. Near optimal finite time identification of arbitrary linear dynamical systems. In International Conference on Machine Learning, pages 5610–5618, 2019.
    Google ScholarLocate open access versionFindings
  • Tuhin Sarkar, Alexander Rakhlin, and Munther A. Dahleh. Finite-Time System Identification for Partially Observed LTI Systems of Unknown Order. arXiv preprint arXiv:1902.01848, 2019.
    Findings
  • Ohad Shamir. On the complexity of bandit and derivative-free stochastic convex optimization. In Conference on Learning Theory, pages 3–24, 2013.
    Google ScholarLocate open access versionFindings
  • David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
    Google ScholarLocate open access versionFindings
  • Max Simchowitz, Horia Mania, Stephen Tu, Michael I Jordan, and Benjamin Recht. Learning without mixing: Towards a sharp analysis of linear system identification. In Conference On Learning Theory, pages 439–473, 2018.
    Google ScholarLocate open access versionFindings
  • Max Simchowitz, Ross Boczar, and Benjamin Recht. Learning linear dynamical systems with semi-parametric least squares. In Conference on Learning Theory, pages 2714–2802, 2019.
    Google ScholarLocate open access versionFindings
  • Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. 2018. Paolo Tilli. Singular values and eigenvalues of non-hermitian block toeplitz matrices. Linear Algebra and its
    Google ScholarLocate open access versionFindings
  • Applications, 272(1-3):59–89, 1998. Stephen Tu and Benjamin Recht. Least-squares temporal difference learning for the linear quadratic regulator. In International Conference on Machine Learning, pages 5005–5014, 2018. Volodya Vovk. Competitive on-line statistics. International Statistical Review, 69(2):213–248, 2001. Bin Yu. Assouad, fano, and le cam. In Festschrift for Lucien Le Cam, pages 423–435.
    Google ScholarLocate open access versionFindings
  • 4. Moreover, P∞(K∞(A, B); A, B ) (21/20)P.
    Google ScholarLocate open access versionFindings
  • 3. Acl, · dlyap[Acl, ] · Acl, (1
    Google ScholarFindings
  • 1. If Y Z and Asafe is stable, then dlyap(Asafe, X) dlyap(Asafe, Y ). 2. Y 0 and Asafe is stable, dlyap(Asafe, Y ) Y. 3. Suppose Rx I, and let A + BK is stable. Then,
    Google ScholarLocate open access versionFindings
  • 4. When Rx I, dlyap[A + BK] P∞[K; A, B], and I 5. If Asafe is stable, dlyap[Asafe] op = dlyap[Asafe] op.
    Google ScholarFindings
  • 3. More generally, if (A + B K) is stable, K K P∞(K; A, B ) = dlyap(A + B K, Rx + K RuK).
    Google ScholarFindings
  • 2. For ◦ ∈ {op, F}, we have
    Google ScholarFindings
  • 3. For t ∈ [0, u) such that x(t) ≥ w(0), g(x(t)) > f (x(t), t).
    Google ScholarLocate open access versionFindings
  • 3. By Lemma B.8, K∞(Ak, Bk)
    Google ScholarFindings
  • 1. In expectation, we have
    Google ScholarFindings
  • 2. Set deff:= min{du, rank(R1) + rank(R2)}. With a probability 1 − δ, we have
    Google ScholarFindings
  • 3. More crudely, we can also bound, with probability 1 − δ, Cost(R1, R2; x1, t, σu)
    Google ScholarFindings
  • 2. Suppose that τk ≥
    Google ScholarFindings
  • 3. If τk τls:= d
    Google ScholarFindings
  • 2. Next, we need an upper bound on τk −1
    Google ScholarFindings
  • 3. Using the second to last inequality in the above display, we see that for
    Google ScholarFindings
Your rating :
0

 

Tags
Comments