AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We study the problem of online learning with dynamics, where a learner interacts with a stateful environment over multiple rounds

Online learning with dynamics: A minimax perspective

NIPS 2020, (2020)

Cited by: 0|Views14
EI
Full Text
Bibtex
Weibo

Abstract

We study the problem of online learning with dynamics, where a learner interacts with a stateful environment over multiple rounds. In each round of the interaction, the learner selects a policy to deploy and incurs a cost that depends on both the chosen policy and current state of the world. The state-evolution dynamics and the costs ar...More

Code:

Data:

Introduction
  • Machine learning systems deployed in the real-world interact with people through their decision making.
  • Given the setup of the previous section, the authors study the online learning with dynamics game between the learner and the adversary through a minimax perspective.
Highlights
  • Machine learning systems deployed in the real-world interact with people through their decision making
  • We study a counterfactual notion of regret, called Policy Regret, where the comparator term is the performance of a policy on the states one would have observed if this policy was deployed from the beginning of time
  • We study the problem of learnability for a class of online learning problems with underlying states evolving as a dynamical system in its full generality
  • By studying the problem in full generality, we show how several well-studied problems in the literature comprising online Markov decision processes [11], online adversarial tracking [1], online linear quadratic regulator [10], online control with adversarial noise [2], and online learning with memory [5, 4] can be seen as specific examples of our general framework
  • While in most cases Empirical Risk Minimization (ERM) does not even have low classical regret let alone policy regret, we show that ERM like strategy in the dual game can lead to the two term decomposition of minimax policy regret we mention above
  • We defer the proof of the proposition to Appendix C. This proposition can be seen as a strengthening of the lower bounds (7a) and (7c) showing that for a very large class of problems, the upper bound given by the mini-batching algorithm and the sequential complexity terms are necessary
Results
  • Given a sequence of adversarial actions ζ1∶t−1, zt, dynamics function Φ, and noise distribution Dw, the counterfactual loss of a policy π at time t is
  • The following theorem provides an upper bound on the value VT in terms of the dynamic stability parameters of the regularized ERMs above as well a sequential Rademacher complexity of the effective loss class Φ ○ Π ∶ = { Φ(π, ⋅) ∶ π ∈ Π}.
  • The theorem says that any non-trivial upper bounds on the stability and sequential complexity terms would guarantee the existence of an online learning algorithm with the corresponding policy regret.
  • The authors' minimax perspective on the problem allows them to study the problem in full generality without making assumptions with respect to the policy class Π, adversarial actions Z and the underlying dynamics Φ, and provide sufficient conditions for learnability.
  • Equation (7a) shows that the sequential Rademacher term is necessary, (7b) establishes necessity for the dynamic stability of the regularized ERM, while (7c) shows that the mini-batching upper bound is tight.
  • With the lower bounds given in Theorem 2, it is natural to ask whether the sufficient conditions in Theorem 1 and Proposition 2 are necessary for every instance of the online learning with dynamics problem.
  • Where βEτRM,t are the dynamic mixability parameters of the mini-batching ERM w.r.t. b) Given a policy class Π and dynamics function Φ, there exists an online learning with dynamics problem (Π, Z, Φ, ) and a universal constant c > 0 such that
  • This proposition can be seen as a strengthening of the lower bounds (7a) and (7c) showing that for a very large class of problems, the upper bound given by the mini-batching algorithm and the sequential complexity terms are necessary.
Conclusion
  • The authors look at specific examples of the online learning with dynamics problem and obtain learnability guarantees for these instances using the upper bounds from Theorem 1.
  • The authors consider a simplified version of the setup from Agarwal et al [2] where the adversary is allowed to perturb the dynamics at each time instance along with the loss functions.
Summary
  • Machine learning systems deployed in the real-world interact with people through their decision making.
  • Given the setup of the previous section, the authors study the online learning with dynamics game between the learner and the adversary through a minimax perspective.
  • Given a sequence of adversarial actions ζ1∶t−1, zt, dynamics function Φ, and noise distribution Dw, the counterfactual loss of a policy π at time t is
  • The following theorem provides an upper bound on the value VT in terms of the dynamic stability parameters of the regularized ERMs above as well a sequential Rademacher complexity of the effective loss class Φ ○ Π ∶ = { Φ(π, ⋅) ∶ π ∈ Π}.
  • The theorem says that any non-trivial upper bounds on the stability and sequential complexity terms would guarantee the existence of an online learning algorithm with the corresponding policy regret.
  • The authors' minimax perspective on the problem allows them to study the problem in full generality without making assumptions with respect to the policy class Π, adversarial actions Z and the underlying dynamics Φ, and provide sufficient conditions for learnability.
  • Equation (7a) shows that the sequential Rademacher term is necessary, (7b) establishes necessity for the dynamic stability of the regularized ERM, while (7c) shows that the mini-batching upper bound is tight.
  • With the lower bounds given in Theorem 2, it is natural to ask whether the sufficient conditions in Theorem 1 and Proposition 2 are necessary for every instance of the online learning with dynamics problem.
  • Where βEτRM,t are the dynamic mixability parameters of the mini-batching ERM w.r.t. b) Given a policy class Π and dynamics function Φ, there exists an online learning with dynamics problem (Π, Z, Φ, ) and a universal constant c > 0 such that
  • This proposition can be seen as a strengthening of the lower bounds (7a) and (7c) showing that for a very large class of problems, the upper bound given by the mini-batching algorithm and the sequential complexity terms are necessary.
  • The authors look at specific examples of the online learning with dynamics problem and obtain learnability guarantees for these instances using the upper bounds from Theorem 1.
  • The authors consider a simplified version of the setup from Agarwal et al [2] where the adversary is allowed to perturb the dynamics at each time instance along with the loss functions.
Related work
  • The classical online learning setup [8] considers a repeated interactive game between a learner and an environment without any notion of underlying dynamics. Sequential complexity measures were introduced in [22] to get tight characterization of minimax regret rates for the classical online learning setting. They showed that for the class of online supervised learning problems, one can upper and lower bound minimax rate in terms of a sequential Rademacher complexity of the predictor class. The works [18, 7] provided an analog of VC theory for online classification and the sequential complexity measures in work [22] provided such a theory for general supervised online learning. This paper can be seen as deriving such characterization of learnability and tight rates for the problem of online learning with dynamics. In the more general setting we consider, while the main mathematical tools introduced in [22] are useful, they are not by themselves sufficient because of the complexities of policy regret and the state dynamics. This is evident from our upper bound which consists of two terms (both of which we show are necessary) and only one of them is a sequential Rademacher complexity type term.
Funding
  • KB is supported by a JP Morgan AI Fellowship
Reference
  • Y. Abbasi-Yadkori, P. Bartlett, and V. Kanade. Tracking adversarial targets. In International Conference on Machine Learning, pages 369–377, 2014.
    Google ScholarLocate open access versionFindings
  • N. Agarwal, B. Bullins, E. Hazan, S. M. Kakade, and K. Singh. Online control with adversarial disturbances. arXiv preprint arXiv:1902.08721, 2019.
    Findings
  • N. Agarwal, E. Hazan, and K. Singh. Logarithmic regret for online control. In Advances in Neural Information Processing Systems, pages 10175–10184, 2019.
    Google ScholarLocate open access versionFindings
  • O. Anava, E. Hazan, and S. Mannor. Online learning for adversaries with memory: price of past mistakes. In Advances in Neural Information Processing Systems, pages 784–792, 2015.
    Google ScholarLocate open access versionFindings
  • R. Arora, O. Dekel, and A. Tewari. Online bandit learning against an adaptive adversary: From regret to policy regret. In Proceedings of the 29th International Coference on International Conference on Machine Learning, 2012.
    Google ScholarLocate open access versionFindings
  • P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3, 2002.
    Google ScholarLocate open access versionFindings
  • S. Ben-David, D. Pál, and S. Shalev-Shwartz. Agnostic online learning.
    Google ScholarFindings
  • N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge university press, 2006.
    Google ScholarFindings
  • L. Chen, Q. Yu, H. Lawrence, and A. Karbasi. Minimax regret of switching-constrained online convex optimization: No phase transition. arXiv preprint arXiv:1910.10873, 2019.
    Findings
  • A. Cohen, A. Hassidim, T. Koren, N. Lazic, Y. Mansour, and K. Talwar. Online linear quadratic control. arXiv preprint arXiv:1806.07104, 2018.
    Findings
  • E. Even-Dar, S. M. Kakade, and Y. Mansour. Online markov decision processes. Mathematics of Operations Research, 34(3):726–736, 2009.
    Google ScholarLocate open access versionFindings
  • D. J. Foster and M. Simchowitz. Logarithmic regret for adversarial online control. arXiv preprint arXiv:2003.00189, 2020.
    Findings
  • W. Han, A. Rakhlin, and K. Sridharan. Competing with strategies. In Conference on Learning Theory, pages 966–992, 2013.
    Google ScholarLocate open access versionFindings
  • M. Hardt, T. Ma, and B. Recht. Gradient descent learns linear dynamical systems. The Journal of Machine Learning Research, 19(1), 2018.
    Google ScholarLocate open access versionFindings
  • E. Hazan et al. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(3-4):157–325, 2016.
    Google ScholarLocate open access versionFindings
  • A. T. Kalai and R. Sastry. The isotron algorithm: High-dimensional isotonic regression. In COLT. Citeseer, 2009.
    Google ScholarLocate open access versionFindings
  • D. E. Kirk. Optimal control theory: an introduction. Courier Corporation, 2004.
    Google ScholarFindings
  • N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine learning, 2(4), 1988.
    Google ScholarLocate open access versionFindings
  • L. Ljung. System identification. Wiley encyclopedia of electrical and electronics engineering, 1999.
    Google ScholarLocate open access versionFindings
  • N. Merhav, E. Ordentlich, G. Seroussi, and M. J. Weinberger. On sequential strategies for loss functions with memory. IEEE Transactions on Information Theory, 48(7):1947–1958, 2002.
    Google ScholarLocate open access versionFindings
  • A. Rakhlin and K. Sridharan. Statistical Learning and Sequential Prediction. Lecture Notes, 2014.
    Google ScholarLocate open access versionFindings
  • A. Rakhlin, K. Sridharan, and A. Tewari. Online learning: Random averages, combinatorial parameters, and learnability. In Advances in Neural Information Processing Systems, 2010.
    Google ScholarLocate open access versionFindings
  • A. Rakhlin, K. Sridharan, and A. Tewari. Online learning via sequential complexities. Journal of Machine Learning Research, 16(2):155–186, 2015.
    Google ScholarLocate open access versionFindings
  • S. Shalev-Shwartz et al. Online learning and online convex optimization. Foundations and Trends® in Machine Learning, 4(2):107–194, 2012.
    Google ScholarLocate open access versionFindings
  • M. Simchowitz and D. J. Foster. Naive exploration is optimal for online lqr. arXiv preprint arXiv:2001.09576, 2020.
    Findings
  • R. F. Stengel. Optimal control and estimation. Courier Corporation, 1994.
    Google ScholarFindings
  • A. S. Suggala and P. Netrapalli. Online non-convex learning: Following the perturbed leader is optimal. arXiv preprint arXiv:1903.08110, 2019.
    Findings
  • V. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability & Its Applications, 16, 1971.
    Google ScholarLocate open access versionFindings
Author
Kush Bhatia
Kush Bhatia
Karthik Sridharan
Karthik Sridharan
Your rating :
0

 

Tags
Comments
小科