# Naive Exploration is Optimal for Online LQR

ICML, pp. 8937-8948, 2020.

Cited by: 0|Views8
EI
Weibo:
We have established that the asymptotically optimal regret for the online linear-quadratic regulator problem is Θ( d2udxT ), and that this rate is attained by ε-greedy exploration

Abstract:

We consider the problem of online adaptive control of the linear quadratic regulator, where the true system parameters are unknown. We prove new upper and lower bounds demonstrating that the optimal regret scales as $\widetilde{\Theta}({\sqrt{d_{\mathbf{u}}^2 d_{\mathbf{x}} T}})$, where $T$ is the number of time steps, $d_{\mathbf{u}}$ ...More

Code:

Data:

Full Text
Bibtex
Weibo
Introduction
• Reinforcement learning has recently achieved great success in application domains including Atari [Mnih et al, 2015], Go [Silver et al, 2017], and robotics [Lillicrap et al, 2015]
• All of these breakthroughs leverage data-driven methods for continuous control in large state spaces.
• Their success, along with challenges in deploying RL in the real world, has led to renewed interest on developing continuous control algorithms with improved reliability and sample efficiency.
Highlights
• Reinforcement learning has recently achieved great success in application domains including Atari [Mnih et al, 2015], Go [Silver et al, 2017], and robotics [Lillicrap et al, 2015]
• Theoretical results for continuous control setting have been more elusive, with progress spread across various models [Kakade et al, 2003, Munos and Szepesvari, 2008, Jiang et al, 2017, Jin et al, 2019], but the linear-quadratic regulator (LQR) problem has recently emerged as a candidate for a standard benchmark for continuous control and RL
• We address a curious question raised by this work: Is sophisticated exploration helpful for linear-quadratic regulator, is linear control truly substantially easier than the general reinforcement learning setting? More broadly, we aim to shed light on the question: To what extent to do sophisticated exploration strategies improve learning in online linear-quadratic control?
• The online linear-quadratic regulator setting we study was introduced by Abbasi-Yadkori and Szepesvari [2011], which considers the problem of controlling an unknown linear system under stationary stochastic noise.5 √They showed that an algorithm based on the optimism in the face of uncertainty (OFU) principle enjoys T, but their algorithm is computationally inefficient and their regret bound depends exponentially on dimension
• We have established that the asymptotically optimal regret for the online linear-quadratic regulator problem is Θ( d2udxT ), and that this rate is attained by ε-greedy exploration
Results
• The authors state the main upper and lower bounds for online LQR and give a high-level overview of the proof techniques behind both results.
• At the end of the section, the authors instantiate and compare the two bounds for the simple special case of strongly stable systems
• Both the upper and lower bounds are motivated by the following question: Suppose that the learner is selecting near optimal control inputs ut ≈ K xt, where K = K∞(A , B ) is the optimal controller for the system (A , B ).
• The local minimax lower bound immediately implies a lower bounds on the global minimax complexity as well.8
Conclusion
• The authors have established that the asymptotically optimal regret for the online LQR problem is Θ( d2udxT ), and that this rate is attained by ε-greedy exploration.
• On the purely technical side, recall that while the upper and lower bound match in terms of du, dx, and T , they differ in their polynomial dependence on P op.
• Does closing this gap require new algorithmic techniques, or will a better analysis suffice?
Summary
• ## Introduction:

Reinforcement learning has recently achieved great success in application domains including Atari [Mnih et al, 2015], Go [Silver et al, 2017], and robotics [Lillicrap et al, 2015]
• All of these breakthroughs leverage data-driven methods for continuous control in large state spaces.
• Their success, along with challenges in deploying RL in the real world, has led to renewed interest on developing continuous control algorithms with improved reliability and sample efficiency.
• ## Results:

The authors state the main upper and lower bounds for online LQR and give a high-level overview of the proof techniques behind both results.
• At the end of the section, the authors instantiate and compare the two bounds for the simple special case of strongly stable systems
• Both the upper and lower bounds are motivated by the following question: Suppose that the learner is selecting near optimal control inputs ut ≈ K xt, where K = K∞(A , B ) is the optimal controller for the system (A , B ).
• The local minimax lower bound immediately implies a lower bounds on the global minimax complexity as well.8
• ## Conclusion:

The authors have established that the asymptotically optimal regret for the online LQR problem is Θ( d2udxT ), and that this rate is attained by ε-greedy exploration.
• On the purely technical side, recall that while the upper and lower bound match in terms of du, dx, and T , they differ in their polynomial dependence on P op.
• Does closing this gap require new algorithmic techniques, or will a better analysis suffice?
Related work
• Non-asymptotic guarantees for learning linear dynamical systems have been the subject of intense recent interest [Dean et al, Hazan et al, 2017, Tu and Recht, 2018, Hazan et al, 2018, Simchowitz et al, 2018, Sarkar and Rakhlin, 2019, Simchowitz et al, 2019, Mania et al, 2019, Sarkar et al, 2019]. The online LQR setting we study was introduced by Abbasi-Yadkori and Szepesvari [2011], which considers the problem of controlling an unknown linear system under stationary stochastic noise.5 √They showed that an algorithm based on the optimism in the face of uncertainty (OFU) principle enjoys T , but their algorithm is computationally inefficient and their regret bound depends exponentially on dimension. The problem was revisited by Dean et al [2018], who showed that an explicit explore-exploit scheme based ε-gr√eedy exploration and certainty equivalence achieves T 2/3 regret efficiently, and left the question of obtaining T regret efficiently as an open problem. This issue was subsequently add√ressed by Faradonbeh et al [2018a] and Mania √et al [2019], who showed that certainty equivalence obtains T regret, and Cohen et al [2019], who achieve T regret using a semidefinite programming relaxation for the OFU scheme. The regret bounds in Faradonbeh et al [2018a] do not specify dependence, and (for dx ≥ du), the dimension scaling of Cohen et al [2019] can be as large as d1x6T 6; Mania et al [2019] incurs a dimension dependence of d3xT (suboptimal when du dx), but at the expense of imposing a strong controllability assumption.

The question of whether regret for online LQR could be improved further (for example, to log T ) remained open, and was√left as a conjecture by Faradonbeh et al [2018b]. Our lower bounds resolve this conjecture by showing that T -regret is optimal. Moreover, by refining the upper bounds of Mania et al [2019], our results show that the asymptotically optimal regret is Θ( d2udxT ), and that this achieved by certainty equivalence. Beyond attaining the optimal dimension dependence, our upper bounds also enjoy refined dependence on problem parameters, and do not require a-priori knowledge of these parameters.
Reference
• Yasin Abbasi-Yadkori and Csaba Szepesvari. Regret bounds for the adaptive control of linear quadratic systems. In Proceedings of the 24th Annual Conference on Learning Theory, pages 1–26, 2011.
• Marc Abeille and Alessandro Lazaric. Thompson sampling for linear-quadratic control problems. In Artificial Intelligence and Statistics, pages 1246–1254, 2017.
• Marc Abeille and Alessandro Lazaric. Improved regret bounds for thompson sampling in linear quadratic control problems. In International Conference on Machine Learning, pages 1–9, 2018.
• Naman Agarwal, Brian Bullins, Elad Hazan, Sham Kakade, and Karan Singh. Online control with adversarial disturbances. In International Conference on Machine Learning, pages 111–119, 2019a.
• Naman Agarwal, Elad Hazan, and Karan Singh. Logarithmic regret for online control. In Advances in Neural Information Processing Systems 32, pages 10175–10184. 2019b.
• Ery Arias-Castro, Emmanuel J Candes, and Mark A Davenport. On the fundamental limits of adaptive sensing. IEEE Transactions on Information Theory, 59(1):472–481, 2012.
• Patrice Assouad. Deux remarques sur l’estimation. Comptes rendus des seances de l’Academie des sciences. Serie 1, Mathematique, 296(23):1021–1024, 1983.
• Mohammad Gheshlaghi Azar, Ian Osband, and Remi Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 263–272. JMLR. org, 2017.
• Dimitri P Bertsekas. Dynamic Programming and Optimal Control, Vol. I. Athena Scientific, 2005.
• Nicoletta Bof, Ruggero Carli, and Luca Schenato. Lyapunov theory for discrete time systems. arXiv preprint arXiv:1809.05289, 2018.
• Stephen Boyd. Lecture 13: Linear quadratic lyapunov theory. EE363 Course Notes, Stanford University, 2008.
• Alon Cohen, Avinatan Hasidim, Tomer Koren, Nevena Lazic, Yishay Mansour, and Kunal Talwar. Online linear quadratic control. In International Conference on Machine Learning, pages 1028–1037, 2018.
• Al√on Cohen, Tomer Koren, and Yishay Mansour. Learning linear-quadratic regulators efficiently with only T regret. In International Conference on Machine Learning, pages 1300–1309, 2019.
• Christoph Dann and Emma Brunskill. Sample complexity of episodic fixed-horizon reinforcement learning. In Advances in Neural Information Processing Systems, pages 2818–2826, 2015.
• Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. Regret bounds for robust adaptive control of the linear quadratic regulator. In Advances in Neural Information Processing Systems, pages 4188–4197, 2018.
• Mohamad Kazem Shirani Faradonbeh, Ambuj Tewari, and George Michailidis. Input perturbations for adaptive regulation and learning. arXiv preprint arXiv:1811.04258, 2018a.
• Maryam Fazel, Rong Ge, Sham Kakade, and Mehran Mesbahi. Global convergence of policy gradient methods for the linear quadratic regulator. In International Conference on Machine Learning, pages 1466–1475, 2018.
• Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2):169–192, 2007.
• Elad Hazan, Karan Singh, and Cyril Zhang. Learning linear dynamical systems via spectral filtering. In Advances in Neural Information Processing Systems, pages 6702–6712, 2017.
• Elad Hazan, Holden Lee, Karan Singh, Cyril Zhang, and Yi Zhang. Spectral filtering for general linear dynamical systems. In Advances in Neural Information Processing Systems, pages 4634–4643, 2018.
• Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
• Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. Contextual decision processes with low bellman rank are pac-learnable. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1704–1713. JMLR. org, 2017.
• Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. arXiv preprint arXiv:1907.05388, 2019.
• Sham Kakade, Michael J Kearns, and John Langford. Exploration in metric state spaces. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pages 306–312, 2003.
• Michael J Kearns, Yishay Mansour, and Andrew Y Ng. Approximate planning in large pomdps via reusable trajectories. In Advances in Neural Information Processing Systems, pages 1001–1007, 2000.
• John Langford and Tong Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. In Proceedings of the 20th International Conference on Neural Information Processing Systems, pages 817–824.
• Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
• Bo Lincoln and Anders Rantzer. Relaxing dynamic programming. IEEE Transactions on Automatic Control, 51(8):1249–1260, 2006.
• Horia Mania, Stephen Tu, and Benjamin Recht. Certainty equivalence is efficient for linear quadratic control. In Advances in Neural Information Processing Systems, pages 10154–10164, 2019.
• Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
• Remi Munos and Csaba Szepesvari. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9(May):815–857, 2008.
• Yi Ouyang, Mukul Gagrani, and Rahul Jain. Control of unknown linear systems with thompson sampling. In 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 1198–1205. IEEE, 2017.
• Alexander Rakhlin and Karthik Sridharan. Online nonparametric regression. In Conference on Learning Theory, 2014.
• Tuhin Sarkar and Alexander Rakhlin. Near optimal finite time identification of arbitrary linear dynamical systems. In International Conference on Machine Learning, pages 5610–5618, 2019.
• Tuhin Sarkar, Alexander Rakhlin, and Munther A. Dahleh. Finite-Time System Identification for Partially Observed LTI Systems of Unknown Order. arXiv preprint arXiv:1902.01848, 2019.
• Ohad Shamir. On the complexity of bandit and derivative-free stochastic convex optimization. In Conference on Learning Theory, pages 3–24, 2013.
• David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
• Max Simchowitz, Horia Mania, Stephen Tu, Michael I Jordan, and Benjamin Recht. Learning without mixing: Towards a sharp analysis of linear system identification. In Conference On Learning Theory, pages 439–473, 2018.
• Max Simchowitz, Ross Boczar, and Benjamin Recht. Learning linear dynamical systems with semi-parametric least squares. In Conference on Learning Theory, pages 2714–2802, 2019.
• Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. 2018. Paolo Tilli. Singular values and eigenvalues of non-hermitian block toeplitz matrices. Linear Algebra and its
• Applications, 272(1-3):59–89, 1998. Stephen Tu and Benjamin Recht. Least-squares temporal difference learning for the linear quadratic regulator. In International Conference on Machine Learning, pages 5005–5014, 2018. Volodya Vovk. Competitive on-line statistics. International Statistical Review, 69(2):213–248, 2001. Bin Yu. Assouad, fano, and le cam. In Festschrift for Lucien Le Cam, pages 423–435.
• 4. Moreover, P∞(K∞(A, B); A, B ) (21/20)P.
• 3. Acl, · dlyap[Acl, ] · Acl, (1
• 1. If Y Z and Asafe is stable, then dlyap(Asafe, X) dlyap(Asafe, Y ). 2. Y 0 and Asafe is stable, dlyap(Asafe, Y ) Y. 3. Suppose Rx I, and let A + BK is stable. Then,
• 4. When Rx I, dlyap[A + BK] P∞[K; A, B], and I 5. If Asafe is stable, dlyap[Asafe] op = dlyap[Asafe] op.
• 3. More generally, if (A + B K) is stable, K K P∞(K; A, B ) = dlyap(A + B K, Rx + K RuK).
• 2. For ◦ ∈ {op, F}, we have
• 3. For t ∈ [0, u) such that x(t) ≥ w(0), g(x(t)) > f (x(t), t).
• 3. By Lemma B.8, K∞(Ak, Bk)
• 1. In expectation, we have
• 2. Set deff:= min{du, rank(R1) + rank(R2)}. With a probability 1 − δ, we have
• 3. More crudely, we can also bound, with probability 1 − δ, Cost(R1, R2; x1, t, σu)
• 2. Suppose that τk ≥
• 3. If τk τls:= d
• 2. Next, we need an upper bound on τk −1
• 3. Using the second to last inequality in the above display, we see that for