Reparameterizing Mirror Descent as Gradient Descent

NIPS 2020, 2020.

Cited by: 4|Views20
EI
Weibo:
For the underdetermined linear regression problem we showed that under certain conditions, the tempered EGU± updates converge to the minimum L2−τ -norm solution

Abstract:

Most of the recent successful applications of neural networks have been based on training with gradient descent updates. However, for some small networks, other mirror descent updates learn provably more efficiently when the target is sparse. We present a general framework for casting a mirror descent update as a gradient descent update o...More

Code:

Data:

0
Full Text
Bibtex
Weibo
Introduction
  • Mirror descent (MD) [Nemirovsky and Yudin, 1983, Kivinen and Warmuth, 1997] refers to a family of updates which transform the parameters w ∈ C from a convex domain C ∈ Rd via a link function (a.k.a. mirror map) f : C → Rd before applying the descent step.
  • ∂f ∂t is the time derivative of the link function and the vanilla discretized MD update is obtained by setting the step size h equal to 1.
  • The CMD update on parameter w for the convex function F (with link f (w) = ∇F (w)) and loss L(w),
  • F (w(t)) = −η ∇L(w(t)) , coincides with the CMD update on parameters u for the convex function G (with link g(u) := ∇G(u)) and the composite loss L◦q,
Highlights
  • Mirror descent (MD) [Nemirovsky and Yudin, 1983, Kivinen and Warmuth, 1997] refers to a family of updates which transform the parameters w ∈ C from a convex domain C ∈ Rd via a link function (a.k.a. mirror map) f : C → Rd before applying the descent step
  • We provide a general framework that allows reparameterizing one continuous-time mirror descent (CMD) update by another
  • We develop a more general framework for reparameterizing one CMD update by another
  • We show that the CMD update (1) can be motivated by replacing the Bregman divergence in the minimization problem (3) with a “momentum” version which quantifies the rate of change in the value of Bregman divergence as w(t) varies over time
  • We show that reparameterizing the tempered updates as gradient descent (GD) updates on the composite loss L◦q changes the implicit bias of the GD, making the updates to converge to the solution with the smallest L2−τ -norm for arbitrary τ ∈ [0, 1]
  • For the underdetermined linear regression problem we showed that under certain conditions, the tempered EGU± updates converge to the minimum L2−τ -norm solution
Results
  • The CMD update on u with the link function g(u) can be written in the NGD form as
  • The authors will mainly consider reparameterizing a CMD update with the link function f (w) as a GD update on u, for which the authors have HG = Ik. Example 2 (EGU as GD).
  • The normalized reduced EG update [Warmuth and Jagota, 1998] is motivated by the link function f (w) log w 1−w
  • The authors can first apply the inverse reparameterization of the Burg update as GD from Example 4, i.e. u = q−1(w) = log w.
  • The authors extend the reparameterization of the EGU update as GD in Example 2 to the normalized case in terms of a projected GD update.
  • The tempered continuous EGU update can be reparameterized continuous-time GD with the reparameterization function w = qτ (u) =
  • The reparameterization of the tempered EGU± updates as GD can be written by applying Proposition 2, u +(t) = −η ∇u+ L qτ (u+(t))−qτ (u−(t)) , u −(t) = −η ∇u− L qτ (u+(t))−qτ (u−(t)) , (22)
  • The strong convexity of the Fτ function w.r.t. the L2−τ -norm suggests that the updates motivated by the tempered Bregman divergence (17) yield the minimum L2−τ -norm solution in certain settings.
  • The authors show that the solution of the tempered EGU± satisfies the dual feasibility and complementary slackness KKT conditions for the following optimization problem: min w+ ,w−
  • Under the assumptions of Theorem 4, the reparameterized tempered EGU± updates (22) recover the minimum L2−τ -norm solution where w(t) = qτ (u+(t)) − qτ (u−(t)).
Conclusion
  • The authors discussed the continuous-time mirror descent updates and provided a general framework for reparameterizing these updates.
  • For the underdetermined linear regression problem the authors showed that under certain conditions, the tempered EGU± updates converge to the minimum L2−τ -norm solution.
  • The result of the paper suggests that the mirror descent updates can be effectively used in neural networks by running backpropagation on the reparameterized form of the neurons.
  • A key research direction is to find general conditions for which this is true
Summary
  • Mirror descent (MD) [Nemirovsky and Yudin, 1983, Kivinen and Warmuth, 1997] refers to a family of updates which transform the parameters w ∈ C from a convex domain C ∈ Rd via a link function (a.k.a. mirror map) f : C → Rd before applying the descent step.
  • ∂f ∂t is the time derivative of the link function and the vanilla discretized MD update is obtained by setting the step size h equal to 1.
  • The CMD update on parameter w for the convex function F (with link f (w) = ∇F (w)) and loss L(w),
  • F (w(t)) = −η ∇L(w(t)) , coincides with the CMD update on parameters u for the convex function G (with link g(u) := ∇G(u)) and the composite loss L◦q,
  • The CMD update on u with the link function g(u) can be written in the NGD form as
  • The authors will mainly consider reparameterizing a CMD update with the link function f (w) as a GD update on u, for which the authors have HG = Ik. Example 2 (EGU as GD).
  • The normalized reduced EG update [Warmuth and Jagota, 1998] is motivated by the link function f (w) log w 1−w
  • The authors can first apply the inverse reparameterization of the Burg update as GD from Example 4, i.e. u = q−1(w) = log w.
  • The authors extend the reparameterization of the EGU update as GD in Example 2 to the normalized case in terms of a projected GD update.
  • The tempered continuous EGU update can be reparameterized continuous-time GD with the reparameterization function w = qτ (u) =
  • The reparameterization of the tempered EGU± updates as GD can be written by applying Proposition 2, u +(t) = −η ∇u+ L qτ (u+(t))−qτ (u−(t)) , u −(t) = −η ∇u− L qτ (u+(t))−qτ (u−(t)) , (22)
  • The strong convexity of the Fτ function w.r.t. the L2−τ -norm suggests that the updates motivated by the tempered Bregman divergence (17) yield the minimum L2−τ -norm solution in certain settings.
  • The authors show that the solution of the tempered EGU± satisfies the dual feasibility and complementary slackness KKT conditions for the following optimization problem: min w+ ,w−
  • Under the assumptions of Theorem 4, the reparameterized tempered EGU± updates (22) recover the minimum L2−τ -norm solution where w(t) = qτ (u+(t)) − qτ (u−(t)).
  • The authors discussed the continuous-time mirror descent updates and provided a general framework for reparameterizing these updates.
  • For the underdetermined linear regression problem the authors showed that under certain conditions, the tempered EGU± updates converge to the minimum L2−τ -norm solution.
  • The result of the paper suggests that the mirror descent updates can be effectively used in neural networks by running backpropagation on the reparameterized form of the neurons.
  • A key research direction is to find general conditions for which this is true
Reference
  • Ethan Akin. The geometry of population genetics, volume 31 of Lecture Notes in Biomathematics. Springer-Verlag, Berlin-New York, 1979.
    Google ScholarFindings
  • Shun-ichi Amari. Natural gradient works efficiently in learning. Neural Computation, 10(2):251–276, 1998.
    Google ScholarLocate open access versionFindings
  • E. Amid, M. K. Warmuth, R. Anil, and K. Tomer. Robust bi-tempered logistic loss based on Bregman divergences. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NeurIPS’19, Cambridge, MA, USA, 2019.
    Google ScholarLocate open access versionFindings
  • Ehsan Amid and Manfred K. Warmuth. Winnowing with gradient descent. In Conference on Learning Theory (COLT), 2020.
    Google ScholarLocate open access versionFindings
  • William L. Burke. Applied Differential Geometry. Cambridge University Press, 1985.
    Google ScholarFindings
  • Nicolo Cesa-Bianchi, Yishay Mansour, and Gilles Stoltz. Improved second-order bounds for prediction with expert advice. Machine Learning, 66(2-3):321–352, 2007.
    Google ScholarLocate open access versionFindings
  • Andrzej Cichocki and Shun-ichi Amari. Families of alpha-beta-and gamma-divergences: Flexible and robust measures of similarities. Entropy, 12(6):1532–1568, 2010.
    Google ScholarLocate open access versionFindings
  • U. Ghai, E. Hazan, and S. Singer. Exponentiated gradient vs. meets gradient descent. arXiv preprint arXiv:1902.01903, 2019.
    Findings
  • Suriya Gunasekar, Blake E Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Implicit regularization in matrix factorization. In Advances in Neural Information Processing Systems (NeurIPS), pages 6151–6159, 2017.
    Google ScholarLocate open access versionFindings
  • Suriya Gunasekar, Jason D Lee, Daniel Soudry, and Nati Srebro. Implicit bias of gradient descent on linear convolutional networks. In Advances in Neural Information Processing Systems (NeurIPS), pages 9461–9471, 2018.
    Google ScholarLocate open access versionFindings
  • J. Kivinen, M. K. Warmuth, and P. Auer. The Perceptron algorithm vs. Winnow: linear vs. logarithmic mistake bounds when few input variables are relevant. Artificial Intelligence, 97:325–343, December 1997.
    Google ScholarLocate open access versionFindings
  • Jyrki Kivinen and Manfred K. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1):1–63, 1997.
    Google ScholarLocate open access versionFindings
  • Jyrki Kivinen, Manfred K Warmuth, and Babak Hassibi. The p-norm generalization of the LMS algorithm for adaptive filtering. IEEE Transactions on Signal Processing, 54(5):1782–1793, 2006.
    Google ScholarLocate open access versionFindings
  • N Littlestone and MK Warmuth. The weighted majority algorithm. Information and Computation, 108(2):212–261, 1994.
    Google ScholarLocate open access versionFindings
  • Jan Naudts. deformed exponentials and logarithms in generalized thermostatistics. physica a, 316: 323–334, 2002. URL http://arxiv.org/pdf/cond-mat/0203489.
    Locate open access versionFindings
  • A. Nemirovsky and D. Yudin. Problem Complexity and Method Efficiency in Optimization. Wiley & Sons, New York, 1983.
    Google ScholarFindings
  • Jiazhong Nie, Wojciech Kotłowski, and Manfred K Warmuth. Online PCA with optimal regret. The Journal of Machine Learning Research, 17(1):6022–6070, 2016.
    Google ScholarLocate open access versionFindings
  • Maxim Raginsky and Jake Bouvrie. Continuous-time stochastic mirror descent on a network: Variance reduction, consensus, convergence. In 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), pages 6793–6800. IEEE, 2012.
    Google ScholarLocate open access versionFindings
  • Garvesh Raskutti and Sayan Mukherjee. The information geometry of mirror descent. IEEE Transactions on Information Theory, 61(3):1451–1457, 2015.
    Google ScholarLocate open access versionFindings
  • R Tyrrell Rockafellar. Monotone operators and the proximal point algorithm. SIAM journal on control and optimization, 14(5):877–898, 1976.
    Google ScholarLocate open access versionFindings
  • William H Sandholm. Population games and evolutionary dynamics. MIT Press, 2010. Shai Shalev-Shwartz et al. Online learning and online convex optimization. Foundations and Trends®
    Google ScholarLocate open access versionFindings
  • in Machine Learning, 4(2):107–194, 2012. Tomas Vaskevicius, Varun Kanade, and Patrick Rebeschini. Implicit regularization for optimal sparse recovery. In Advances in Neural Information Processing Systems (NeurIPS), pages 2968–2979, 2019. S.V.N. Vishwanathan and M.K. Warmuth. Leaving the span. In Proceedings of the 18th Annual Conference on Learning Theory (COLT), 2005. M. K. Warmuth and A. Jagota. Continuous and discrete time nonlinear gradient descent: relative loss bounds and convergence. In R. Greiner E. Boros, editor, Electronic Proceedings of Fifth International Symposium on Artificial Intelligence and Mathematics. Electronic,http://rutcor.rutgers.edu/̃amai, 1998.
    Locate open access versionFindings
  • M. K. Warmuth, W. Kotłowski, and S. Zhou. Kernelization of matrix updates. Journal of Theoretical Computer Science, 558:159–178, 2014. Special issue for the 23nd International Conference on Algorithmic Learning Theory (ALT’12). Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology), 67(2):301–320, 2005.
    Google ScholarLocate open access versionFindings
  • 0. Using the equality g• (u(t)) C| ψ w) = 0}. As a result, a Bregman projection [Shalev-Shwartz et al., 2012] into Cψ may need to be applied after the update, that is ws+1 = argmin DF (w, ws+1).
    Google ScholarLocate open access versionFindings
  • ws+1, ws+1 1 which corresponds to the Bregman projection onto the unit simplex using the relative entropy divergence [Kivinen and Warmuth, 1997].
    Google ScholarLocate open access versionFindings
  • Note that in this case, the update satisfies the constraint ψ(ws+1) = 0 because of directly using the Lagrange multiplier. For the normalized EG update, this corresponds to the original normalized EG update in [Littlestone and Warmuth, 1994], ws+1 =
    Google ScholarFindings
Your rating :
0

 

Tags
Comments