Dynamic Regret of Convex and Smooth Functions

NIPS 2020, 2020.

Cited by: 1|Bibtex|Views46
EI
Other Links: arxiv.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
VT measures the variation in gradients and FT is the cumulative loss of the comparator sequence

Abstract:

We investigate online convex optimization in non-stationary environments and choose the dynamic regret as the performance measure, defined as the difference between cumulative loss incurred by the online algorithm and that of any feasible comparator sequence. Let $T$ be the time horizon and $P_T$ be the path-length that essentially refl...More

Code:

Data:

Introduction
  • In many real-world applications, data are inherently accumulated over time, and it is of great importance to develop a learning system that updates in an online fashion.
  • Which is the difference between cumulative loss incurred by the online algorithm and that of the best decision in hindsight
  • The rationale behind such a metric is that the best fixed decision in hindsight is reasonably good over all the iterations.
  • This is too optimistic and may not hold in changing environments, where data are evolving and the optimal decision is drifting over time.
Highlights
  • In many real-world applications, data are inherently accumulated over time, and it is of great importance to develop a learning system that updates in an online fashion
  • 3.4 Lower Bound We here present the lower bound for dynamic regret of convex and smooth functions
  • We exploit smoothness to enhance the dynamic regret, with the aim to replace the time horizon T in the state-of-the-art O( T (1 + PT )) bound by problem-dependent quantities that are at most O(T ) but can be much smaller in easy problems. We achieve this goal by proposing two meta-expert algorithms: Swordvar which attains a variation bound of order O( (1 + PT + VT )(1 + PT )), and Swordsmall which enjoys a small-loss bound of order O( (1 + PT + FT )(1 + PT ))
  • VT measures the variation in gradients and FT is the cumulative loss of the comparator sequence
  • Our dynamic regret bounds are universal in the sense that they hold against any feasible comparator sequence, and the algorithms are more adaptive to the non-stationary environments
  • We present the lower bound for dynamic regret of convex and smooth functions, showing the tightness of our obtained upper bounds
Methods
  • OEGD OGD OEGD & OGD Meta.
  • Meta-algorithm.
  • The authors adopt the OptimisticHedge algorithm along with the linearized surrogate loss as the meta-algorithm, where the weight vector pt+1 ∈ ∆N1+N2 is updated according to t.
  • S=1 where the optimism mt+1 ∈ RN1+N2.
  • In order to facilitate the meta-algorithm with both kinds of adaptivity (VT and FT ), it is crucial to design best-of-both-worlds optimism.
  • The authors set the optimism mt+1 in the following way: for each i ∈ [N1 + N2]
Conclusion
  • The authors exploit smoothness to enhance the dynamic regret, with the aim to replace the time horizon T in the state-of-the-art O( T (1 + PT )) bound by problem-dependent quantities that are at most O(T ) but can be much smaller in easy problems
  • The authors achieve this goal by proposing two meta-expert algorithms: Swordvar which attains a variation bound of order O( (1 + PT + VT )(1 + PT )), and Swordsmall which enjoys a small-loss bound of order O( (1 + PT + FT )(1 + PT )).
  • The authors will investigate the possibility of exploiting other function curvatures, such as strong convexity or exp-concavity, into the analysis of the universal dynamic regret
Summary
  • Introduction:

    In many real-world applications, data are inherently accumulated over time, and it is of great importance to develop a learning system that updates in an online fashion.
  • Which is the difference between cumulative loss incurred by the online algorithm and that of the best decision in hindsight
  • The rationale behind such a metric is that the best fixed decision in hindsight is reasonably good over all the iterations.
  • This is too optimistic and may not hold in changing environments, where data are evolving and the optimal decision is drifting over time.
  • Methods:

    OEGD OGD OEGD & OGD Meta.
  • Meta-algorithm.
  • The authors adopt the OptimisticHedge algorithm along with the linearized surrogate loss as the meta-algorithm, where the weight vector pt+1 ∈ ∆N1+N2 is updated according to t.
  • S=1 where the optimism mt+1 ∈ RN1+N2.
  • In order to facilitate the meta-algorithm with both kinds of adaptivity (VT and FT ), it is crucial to design best-of-both-worlds optimism.
  • The authors set the optimism mt+1 in the following way: for each i ∈ [N1 + N2]
  • Conclusion:

    The authors exploit smoothness to enhance the dynamic regret, with the aim to replace the time horizon T in the state-of-the-art O( T (1 + PT )) bound by problem-dependent quantities that are at most O(T ) but can be much smaller in easy problems
  • The authors achieve this goal by proposing two meta-expert algorithms: Swordvar which attains a variation bound of order O( (1 + PT + VT )(1 + PT )), and Swordsmall which enjoys a small-loss bound of order O( (1 + PT + FT )(1 + PT )).
  • The authors will investigate the possibility of exploiting other function curvatures, such as strong convexity or exp-concavity, into the analysis of the universal dynamic regret
Tables
  • Table1: Summary of expert-algorithms and meta-algorithms as well as different optimism used in the proposed algorithms (including three variants of Sword)
Download tables as Excel
Related work
  • We present a brief review of static and dynamic regret minimization for online convex optimization.

    2.1 Static Regret

    Static regret has been extensively studied in online convex optimization. Let T be the time hor√izon and d be the dimension, there exist online algorithms with static regret bounded by O( T ), O(d log T ), and O(log T ) for convex, exponentially concave, and strongly convex functions, respectively (Zinkevich, 2003; Hazan et al, 2007). These results are proved to be minimax optimal (Abernethy et al, 2008). More results can be found in the seminal books (Shalev-Shwartz, 2012; Hazan, 2016) and reference therein.

    In addition to exploiting convexity of functions, there are studies improving static regret by incorporating smoothness, whose main proposal is to replace the dependence on T by problem-dependent quantities. Such problem-dependent bounds enjoy much benign properties, in particular, they can safeguard the worst-case minimax rate yet can be much tighter in easy problem instances. In the literature, there are two kinds of such bounds, small-loss bounds (Srebro et al, 2010) and gradient variation bounds (Chiang et al, 2012).
Reference
  • Jacob D. Abernethy, Peter L. Bartlett, Alexander Rakhlin, and Ambuj Tewari. Optimal stragies and minimax lower bounds for online convex games. In Proceedings of the 21st Annual Conference on Learning Theory (COLT), pages 415–424, 2008.
    Google ScholarLocate open access versionFindings
  • Dmitry Adamskiy, Wouter M. Koolen, Alexey V. Chernov, and Vladimir Vovk. A closer look at adaptive regret. In Proceedings of the 23rd International Conference on Algorithmic Learning Theory (ALT), pages 290–304, 2012.
    Google ScholarLocate open access versionFindings
  • Peter Auer, Nicolo Cesa-Bianchi, and Claudio Gentile. Adaptive and self-confident on-line learning algorithms. Journal of Computer and System Sciences, 64(1):48–75, 2002.
    Google ScholarLocate open access versionFindings
  • Peter Auer, Yifang Chen, Pratik Gajane, Chung-Wei Lee, Haipeng Luo, Ronald Ortner, and Chen-Yu Wei. Achieving optimal dynamic regret for non-stationary bandits without prior information. In Proceedings of the 32nd Conference on Learning Theory (COLT), pages 159–163, 2019.
    Google ScholarLocate open access versionFindings
  • Dheeraj Baby and Yu-Xiang Wang. Online forecasting of total-variation-bounded sequences. In Advances in Neural Information Processing Systems 32 (NeurIPS), pages 11071–11081, 2019.
    Google ScholarLocate open access versionFindings
  • Omar Besbes, Yonatan Gur, and Assaf J. Zeevi. Non-stationary stochastic optimization. Operations Research, 63(5):1227–1244, 2015.
    Google ScholarLocate open access versionFindings
  • Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006.
    Google ScholarFindings
  • Nicolo Cesa-Bianchi, Yoav Freund, David Haussler, David P. Helmbold, Robert E. Schapire, and Manfred K. Warmuth. How to use expert advice. Journal of the ACM, 44(3):427–485, 1997.
    Google ScholarLocate open access versionFindings
  • Nicolo Cesa-Bianchi, Yishay Mansour, and Gilles Stoltz. Improved second-order bounds for prediction with expert advice. In Proceedings of the 18th Annual Conference on Learning Theory (COLT), pages 217–232, 2005.
    Google ScholarLocate open access versionFindings
  • Chao-Kai Chiang, Tianbao Yang, Chia-Jung Lee, Mehrdad Mahdavi, Chi-Jen Lu, Rong Jin, and Shenghuo Zhu. Online optimization with gradual variations. In Proceedings of the 25th Conference On Learning Theory (COLT), pages 6.1–6.20, 2012.
    Google ScholarLocate open access versionFindings
  • Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119–139, 1997.
    Google ScholarLocate open access versionFindings
  • Pierre Gaillard, Gilles Stoltz, and Tim van Erven. A second-order bound with excess losses. In Proceedings of The 27th Conference on Learning Theory (COLT), pages 176–196, 2014.
    Google ScholarLocate open access versionFindings
  • Elad Hazan. Introduction to Online Convex Optimization. Foundations and Trends in Optimization, 2(3-4):157–325, 2016.
    Google ScholarLocate open access versionFindings
  • Elad Hazan and Satyen Kale. Extracting certainty from uncertainty: Regret bounded by variation in costs. In Proceedings of the 21st Annual Conference on Learning Theory (COLT), pages 57–68, 2008.
    Google ScholarLocate open access versionFindings
  • Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169–192, 2007.
    Google ScholarLocate open access versionFindings
  • Mark Herbster and Manfred K. Warmuth. Tracking the best expert. Machine Learning, 32 (2):151–178, 1998.
    Google ScholarFindings
  • Mark Herbster and Manfred K. Warmuth. Tracking the best linear predictor. Journal of Machine Learning Research, 1:281–309, 2001.
    Google ScholarLocate open access versionFindings
  • Ali Jadbabaie, Alexander Rakhlin, Shahin Shahrampour, and Karthik Sridharan. Online optimization: Competing with dynamic comparators. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics (AISTATS), 2015.
    Google ScholarLocate open access versionFindings
  • Nick Littlestone and Manfred K. Warmuth. The weighted majority algorithm. Information and Computation, 108(2):212–261, 1994.
    Google ScholarLocate open access versionFindings
  • Haipeng Luo and Robert E. Schapire. Achieving all with no parameters: AdaNormalHedge. In Proceedings of the 28th Annual Conference Computational Learning Theory (COLT), pages 1286–1304, 2015.
    Google ScholarLocate open access versionFindings
  • Aryan Mokhtari, Shahin Shahrampour, Ali Jadbabaie, and Alejandro Ribeiro. Online optimization in dynamic environments: Improved regret rates for strongly convex problems. In Proceedings of the 55th IEEE Conference on Decision and Control (CDC), pages 7195– 7201, 2016.
    Google ScholarLocate open access versionFindings
  • Alexander Rakhlin and Karthik Sridharan. Online learning with predictable sequences. In Proceedings of the 26th Conference On Learning Theory (COLT), pages 993–1019, 2013.
    Google ScholarLocate open access versionFindings
  • Shai Shalev-Shwartz. Online Learning: Theory, Algorithms and Applications. PhD Thesis, 2007.
    Google ScholarLocate open access versionFindings
  • Shai Shalev-Shwartz. Online Learning and Online Convex Optimization. Foundations and Trends in Machine Learning, 4(2):107–194, 2012.
    Google ScholarLocate open access versionFindings
  • Nathan Srebro, Karthik Sridharan, and Ambuj Tewari. Smoothness, low noise and fast rates. In Advances in Neural Information Processing Systems 23 (NIPS), pages 2199– 2207. 2010.
    Google ScholarLocate open access versionFindings
  • Vasilis Syrgkanis, Alekh Agarwal, Haipeng Luo, and Robert E. Schapire. Fast convergence of regularized learning in games. In Advances in Neural Information Processing Systems 28 (NIPS), pages 2989–2997, 2015.
    Google ScholarLocate open access versionFindings
  • Tim van Erven and Wouter M. Koolen. Metagrad: Multiple learning rates in online learning. In Advances in Neural Information Processing Systems 29 (NIPS), pages 3666–3674, 2016.
    Google ScholarLocate open access versionFindings
  • Tianbao Yang, Lijun Zhang, Rong Jin, and Jinfeng Yi. Tracking slowly moving clairvoyant: Optimal dynamic regret of online learning with true and noisy gradient. In Proceedings of the 33rd International Conference on Machine Learning (ICML), pages 449–457, 2016.
    Google ScholarLocate open access versionFindings
  • Lijun Zhang, Tianbao Yang, Jinfeng Yi, Rong Jin, and Zhi-Hua Zhou. Improved dynamic regret for non-degeneracy functions. In Advances in Neural Information Processing Systems 30 (NIPS), 2017.
    Google ScholarLocate open access versionFindings
  • Lijun Zhang, Shiyin Lu, and Zhi-Hua Zhou. Adaptive online learning in dynamic environments. In Advances in Neural Information Processing Systems 31 (NeurIPS), pages 1330–1340, 2018.
    Google ScholarLocate open access versionFindings
  • Peng Zhao and Lijun Zhang. Improved analysis for dynamic regret of strongly convex and smooth functions. ArXiv preprint, arXiv:2006.05876, 2020.
    Findings
  • Peng Zhao, Guanghui Wang, Lijun Zhang, and Zhi-Hua Zhou. Bandit convex optimization in non-stationary environments. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1508–1518, 2020.
    Google ScholarLocate open access versionFindings
  • Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML), pages 928–936, 2003.
    Google ScholarLocate open access versionFindings
  • We first restate the gradient-variation static regret proved by Chiang et al. (2012) as follows.
    Google ScholarLocate open access versionFindings
  • Therefore, by choosing η = min{1/(4L), 1/ VT }, OEGD achieves an O( VT ) static regret. Note that the unpleasant dependence on VT can be eliminated by the doubling trick (Cesa-Bianchi et al., 1997), because the gradient variation VT is empirically evaluable at each iteration.
    Google ScholarFindings
  • 0. Therefore the meta-regret of VariationHedge is bounded by, T
    Google ScholarFindings
  • 2. Consequently, the possible minimal and maximal values of the optimal step size are ηmin = Meanwhile, we treat the double logarithmic factor in T as a constant, following previous studies (Adamskiy et al., 2012; Luo and Schapire, 2015). We remark that the bound is the universal dynamic regret in that it holds for any sequence of comparators.
    Google ScholarLocate open access versionFindings
  • In this part, we analyze the expert-algorithm of the Swordvar algorithm, namely, the online gradient descent. We will present the proof of the small-loss dynamic regret bound (Theorem 4). Before that, in the following we first restate the small-loss static regret bound (Srebro et al., 2010, Theorem 2) as well as its proof.
    Google ScholarLocate open access versionFindings
  • Theorem 11 (Theorem 2 of Srebro et al. (2010)). Under Assumptions 2, 3, and 4, by choosing any step size η
    Google ScholarLocate open access versionFindings
  • First, notice that Assumptions 4 and 3 imply ft(·) is nonnegative and L-smooth. From the self-bounding property of smooth functions (Srebro et al., 2010), as shown in Lemma 4, we have
    Google ScholarFindings
  • 22, and issue can be easily addressed by the doubling trick (Cesa-Bianchi et al., 1997) or the self-confident tuning (Auer et al., 2002).
    Google ScholarLocate open access versionFindings
  • 2. As a result, the possible minimal and maximal values of the optimal step size are ηmin = Meanwhile, double logarithmic factors in T are treated as a constant, following previous studies (Adamskiy et al., 2012; Luo and Schapire, 2015). This completes the proof.
    Google ScholarLocate open access versionFindings
  • On the other hand, by noticing that the online function dt is strongly convex and exploiting the regret guarantee of Hedge (Cesa-Bianchi and Lugosi, 2006, Proposition 3.1), we have
    Google ScholarFindings
  • 21. Notice that the above terms are essentially the meta-regret of gradient-variation and smallloss bounds, up to constant factors. Therefore, we can make use of their meta-regret analysis to bound the meta-regret of Swordbest. Specifically, by applying the analysis of Theorem 10, we know that Lemma 1 guarantees the regret bound of OptimisticHedge, which is originally proved by Syrgkanis et al. (in (Syrgkanis et al., 2015, Theorem 19)). For self-containedness, we present its proof and adapt to our notations. Before showing the proof, we need to introduce two related lemmas.
    Google ScholarLocate open access versionFindings
  • The first one is on the property of strongly convex functions (Nesterov, 2018).
    Google ScholarFindings
  • 2. Besides, by the first order condition of convex functions, we have ∇F (x∗), x − x∗ ≥ 0. We complete the proof by combining these two inequalities. The second lemma is due to Syrgkanis et al. (2015), which exploits the stability of the Follow the Regularized Leader (FTRL) algorithm. The FTRL algorithm updates the decision xt in the form of xt = arg min ε Lt, x + R(x), x∈X
    Google ScholarLocate open access versionFindings
  • In this part, we present several technical lemmas used in the proofs. First, we introduce the self-bounding property of smooth functions (Srebro et al., 2010, Lemma 3.1), which is crucial and frequently used in proving problem-dependent bounds for convex and smooth functions.
    Google ScholarLocate open access versionFindings
  • Shalev-Shwartz (2007)). For x − y ≤ a + √ay.
    Google ScholarFindings
  • We consider two cases by noting that x∗ = ΠX [c − ∇]: (1) c − ∇ ∈ X: u − x∗, (c − ∇) − x∗ = 0 clearly satisfies (67); (2) c − ∇ ∈/ X: the Pythagorean theorem (Hazan, 2016, Theorem 2.1) implies (67).
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments