Making Non-Stochastic Control (Almost) as Easy as Stochastic

NIPS 2020, 2020.

Cited by: 0|Views13
EI
Weibo:
√ In this work, we demonstrate that fast rates for online control, and in particular, the optimal T regret rate Simchowitz and Foster for the online Linear Quadratic Regulator setting, are achievable with non-stochastic noise

Abstract:

Recent literature has made much progress in understanding \emph{online LQR}: a modern learning-theoretic take on the classical control problem in which a learner attempts to optimally control an unknown linear dynamical system with fully observed state, perturbed by i.i.d. Gaussian noise. It is now understood that the optimal regret on ...More

Code:

Data:

Introduction
  • A learning agent seeks to minimize cumulative loss in a dynamic environment which responds to its actions.
  • This paper focuses on the widely-studied setting of linear control, where the the learner’s environment is described by a continuous state, and evolves according to a linear system of equations, perturbed by process noise, and guided by inputs chosen by the learner.
  • Performance is measure√d by regret against the optimal LQR control law on a time horizon T , for which the optimal regret rate is Θ( T ) [Cohen et al, 2019, Mania et al, 2019, Simchowitz and Foster, 2020, Cassel et al, 2020]
Highlights
  • In control tasks, a learning agent seeks to minimize cumulative loss in a dynamic environment which responds to its actions
  • This paper focuses on the widely-studied setting of linear control, where the the learner’s environment is described by a continuous state, and evolves according to a linear system of equations, perturbed by process noise, and guided by inputs chosen by the learner
  • We propose Disturbance Reponse Control via Online Newton Step, or Drc-online Newton step (Ons) - an adaptive control policy which attains fast rates previously only known for settings with stochastic or semistochastic noise [Mania et al, 2019, Simchowitz et al, 2020, Cohen et al, 2019, Agarwal et al, 2019b]
  • √ In this work, we demonstrate that fast rates for online control, and in particular, the optimal T regret rate Simchowitz and Foster [2020] for the online Linear Quadratic Regulator (LQR) setting, are achievable with non-stochastic noise
  • Future Work It is an interesting direction for future research to determine if non-degenerate observation noise can be used to attain polylogarithmic regret for unknown systems in the semi-stochastic regime considered by Simchowitz et al [2020]
  • Our work assumes only that our system can be stabilized by a static feedback controller, which holds without loss of generality for fully observed systems
Results
  • LQR consider a regret benchmark typically considered performance with respect to this benchmark RT.
  • JT − T lim n→∞ 1 n Ew [Jn(πK )].
  • Where the righthand term is the infinite horizon average cost induced by placing the optimal control law K.
  • One can show (e.g. Simchowitz and Foster [2020]) Ew[Jn(πK )] is increasing in n.
  • By Jensen’s inequality, it holds that for any Π ⊂ Πldc containing πK , Ew [RT] Ew [JT Ew [Jn (π K.
  • = Ew[JT] − inf Ew[JT (π)]
Conclusion
  • Estimate G[1:h] = (G[i])i∈[h] ← arg minG[1:h]Future Work It is an interesting direction for future research to determine if non-degenerate observation noise can be used to attain polylogarithmic regret for unknown systems in the semi-stochastic regime considered by Simchowitz et al [2020].
  • This regime interpolates between purely stochastic non-degenerate noise, and arbitrary adversarial noise considered√in this setting.
  • The authors' work assumes only that the system can be stabilized by a static feedback controller, which holds without loss of generality for fully observed systems
Summary
  • Introduction:

    A learning agent seeks to minimize cumulative loss in a dynamic environment which responds to its actions.
  • This paper focuses on the widely-studied setting of linear control, where the the learner’s environment is described by a continuous state, and evolves according to a linear system of equations, perturbed by process noise, and guided by inputs chosen by the learner.
  • Performance is measure√d by regret against the optimal LQR control law on a time horizon T , for which the optimal regret rate is Θ( T ) [Cohen et al, 2019, Mania et al, 2019, Simchowitz and Foster, 2020, Cassel et al, 2020]
  • Objectives:

    The authors' goal is to attain logarithmic memory-regret, and quadratic sensitivity to structured errors.
  • The authors will remove this assumption at the end of the proof.
  • ∈ C, the goal is to bound
  • Results:

    LQR consider a regret benchmark typically considered performance with respect to this benchmark RT.
  • JT − T lim n→∞ 1 n Ew [Jn(πK )].
  • Where the righthand term is the infinite horizon average cost induced by placing the optimal control law K.
  • One can show (e.g. Simchowitz and Foster [2020]) Ew[Jn(πK )] is increasing in n.
  • By Jensen’s inequality, it holds that for any Π ⊂ Πldc containing πK , Ew [RT] Ew [JT Ew [Jn (π K.
  • = Ew[JT] − inf Ew[JT (π)]
  • Conclusion:

    Estimate G[1:h] = (G[i])i∈[h] ← arg minG[1:h]Future Work It is an interesting direction for future research to determine if non-degenerate observation noise can be used to attain polylogarithmic regret for unknown systems in the semi-stochastic regime considered by Simchowitz et al [2020].
  • This regime interpolates between purely stochastic non-degenerate noise, and arbitrary adversarial noise considered√in this setting.
  • The authors' work assumes only that the system can be stabilized by a static feedback controller, which holds without loss of generality for fully observed systems
Funding
  • MS is generously supported by an Open Philanthropy AI Fellowship
Reference
  • Yasin Abbasi-Yadkori and Csaba Szepesvári. Regret bounds for the adaptive control of linear quadratic systems. In Proceedings of the 24th Annual Conference on Learning Theory, pages 1–26, 2011.
    Google ScholarLocate open access versionFindings
  • Yasin Abbasi-Yadkori, Peter Bartlett, and Varun Kanade. Tracking adversarial targets. In International Conference on Machine Learning, pages 369–377, 2014.
    Google ScholarLocate open access versionFindings
  • Naman Agarwal, Brian Bullins, Elad Hazan, Sham Kakade, and Karan Singh. Online control with adversarial disturbances. In International Conference on Machine Learning, pages 111–119, 2019a.
    Google ScholarLocate open access versionFindings
  • Naman Agarwal, Elad Hazan, and Karan Singh. Logarithmic regret for online control. In Advances in Neural Information Processing Systems, pages 10175–10184, 2019b.
    Google ScholarLocate open access versionFindings
  • Jason Altschuler and Kunal Talwar. Online learning over a finite action set with limited switching. arXiv preprint arXiv:1803.01548, 2018.
    Findings
  • Oren Anava, Elad Hazan, and Shie Mannor. Online learning for adversaries with memory: price of past mistakes. In Advances in Neural Information Processing Systems, pages 784–792, 2015.
    Google ScholarLocate open access versionFindings
  • Raman Arora, Ofer Dekel, and Ambuj Tewari. Online bandit learning against an adaptive adversary: from regret to policy regret. In International Conference on Machine Learning (ICML), pages 1747–1754, 2012.
    Google ScholarLocate open access versionFindings
  • Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002.
    Google ScholarLocate open access versionFindings
  • Asaf Cassel, Alon Cohen, and Tomer Koren. Logarithmic regret for learning linear quadratic regulators efficiently. arXiv preprint arXiv:2002.08095, 2020.
    Findings
  • Lin Chen, Qian Yu, Hannah Lawrence, and Amin Karbasi. Minimax regret of switching-constrained online convex optimization: No phase transition. arXiv preprint arXiv:1910.10873, 2019.
    Findings
  • Alon Cohen, Avinatan Hasidim, Tomer Koren, Nevena Lazic, Yishay Mansour, and Kunal Talwar. Online linear quadratic control. In International Conference on Machine Learning, pages 1028–1037, 2018.
    Google ScholarLocate open access versionFindings
  • Al√on Cohen, Tomer Koren, and Yishay Mansour. Learning linear-quadratic regulators efficiently with only T regret. In International Conference on Machine Learning, pages 1300–1309, 2019.
    Google ScholarLocate open access versionFindings
  • Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. Regret bounds for robust adaptive control of the linear quadratic regulator. In Advances in Neural Information Processing Systems, pages 4188–4197, 2018.
    Google ScholarLocate open access versionFindings
  • Ofer Dekel, Jian Ding, Tomer Koren, and Yuval Peres. Bandits with switching costs: T 2/3 regret. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 459–467, 2014.
    Google ScholarFindings
  • Olivier Devolder, François Glineur, and Yurii Nesterov. First-order methods of smooth convex optimization with inexact oracle. Mathematical Programming, 146(1-2):37–75, 2014.
    Google ScholarLocate open access versionFindings
  • Dylan J Foster and Max Simchowitz. Logarithmic regret for adversarial online control. arXiv preprint arXiv:2003.00189, 2020.
    Findings
  • Yoram Halevi. Stable lqg controllers. IEEE Transactions on Automatic Control, 39(10):2104–2106, 1994.
    Google ScholarLocate open access versionFindings
  • Elad Hazan. Introduction to online convex optimization. arXiv preprint arXiv:1909.05207, 2019.
    Findings
  • Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169–192, 2007.
    Google ScholarLocate open access versionFindings
  • Elad Hazan, Sham Kakade, and Karan Singh. The nonstochastic control problem. In Proceedings of the 31st International Conference on Algorithmic Learning Theory, pages 408–421. PMLR, 2020.
    Google ScholarLocate open access versionFindings
  • Rudolph Emil Kalman. A new approach to linear filtering and prediction problems. Journal of Basic Engineering, 82.1:35–45, 1960.
    Google ScholarLocate open access versionFindings
  • Sahin Lale, Kamyar Azizzadenesheli, Babak Hassibi, and Anima Anandkumar. Logarithmic regret bound in partially observable linear dynamical systems. arXiv preprint arXiv:2003.11227, 2020a.
    Findings
  • Sahin Lale, Kamyar Azizzadenesheli, Babak Hassibi, and Anima Anandkumar. Regret minimization in partially observable linear quadratic control. arXiv preprint arXiv:2002.00082, 2020b.
    Findings
  • Yingying Li, Xin Chen, and Na Li. Online optimal control with linear dynamics and predictions: Algorithms and regret analysis. In Advances in Neural Information Processing Systems, pages 14887–14899, 2019.
    Google ScholarLocate open access versionFindings
  • Horia Mania, Stephen Tu, and Benjamin Recht. Certainty equivalence is efficient for linear quadratic control. In Advances in Neural Information Processing Systems, pages 10154–10164, 2019.
    Google ScholarLocate open access versionFindings
  • Max Simchowitz and Dylan J Foster. Naive exploration is optimal for online LQR. arXiv preprint arXiv:2001.09576, 2020.
    Findings
  • Max Simchowitz, Ross Boczar, and Benjamin Recht. Learning linear dynamical systems with semi-parametric least squares. arXiv preprint arXiv:1902.00768, 2019.
    Findings
  • Max Simchowitz, Karan Singh, and Elad Hazan. Improper learning for non-stochastic control. arXiv preprint arXiv:2001.09254, 2020.
    Findings
  • Robert F Stengel. Optimal control and estimation. Courier Corporation, 1994. Dante Youla, Hamid Jabr, and Jr Bongiorno. Modern wiener-hopf design of optimal controllers–part ii: The multivariable case. IEEE Transactions on Automatic Control, 21(3):319–338, 1976. Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th international conference on machine learning (icml-03), pages 928–936, 2003.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments