## AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically

Go Generating

## AI Traceability

AI parses the academic lineage of this thesis

Generate MRT

## AI Insight

AI extracts a summary of this paper

Weibo:
This paper presents an affirmative answer to this question, providing nonasymptotic complexity results for two-time scale gradient descent ascent and stochastic GDA in two settings

# On Gradient Descent Ascent for Nonconvex-Concave Minimax Problems.

ICML 2020, (2021)

Cited by: 47|Views182
EI
Full Text
Bibtex
Weibo

Abstract

We consider nonconvex-concave minimax problems, $\min_{x} \max_{y\in\mathcal{Y}} f(x, y)$, where $f$ is nonconvex in $x$ but concave in $y$. The standard algorithm for solving this problem is the celebrated gradient descent ascent (GDA) algorithm, which has been widely used in machine learning, control theory and economics. However, des...More

Code:

Data:

0
Introduction
• The authors consider the following smooth minimax optimization problem: min max f (x, y), x∈Rm y∈Y (1.1)

where f : Rm × Rn → R is nonconvex in x but concave in y and where Y is a convex set.
• One of the simplest candidates for solving problem (1.1) is the natural generalization of gradient descent (GD) known as gradient descent ascent (GDA).
• At each iteration, this algorithm performs gradient descent over the variable x with the stepsize ηx and gradient ascent over the variable y with the stepsize ηy.
• On the positive side, when the objective function f is convex-concave in a pair of
Highlights
• We consider the following smooth minimax optimization problem: min max f (x, y), x∈Rm y∈Y (1.1)

where f : Rm × Rn → R is nonconvex in x but concave in y and where Y is a convex set
• This paper presents an affirmative answer to this question, providing nonasymptotic complexity results for two-time scale gradient descent ascent and stochastic GDA in two settings
• We present the complexity results for two-time-scale gradient descent ascent and stochastic GDA in the setting of nonconvex-strongly-concave minimax problems
• We present the complexity results for two-time-scale gradient descent ascent and stochastic GDA in the nonconvexconcave minimax setting
• Denote ∆Φ = Φ1/2l(x0) − minx Φ1/2l(x) and ∆0 = Φ(x0) − f (x0, y0), we present complexity results for two-time-scale gradient descent ascent and stochastic GDA algorithms
• We have shown that two-time-scale gradient descent ascent and stochastic GDA return an ǫ-stationary point in O(κ2ǫ−2) gradient evaluations and O(κ3ǫ−4) stochastic gradient evaluations in the nonconvex-stronglyconcave case, and O(ǫ−6) gradient evaluations and O(ǫ−8) stochastic gradient evaluations in the nonconvexconcave case
Methods
• The authors present several empirical results to show that two-time-scale GDA outperforms GDmax.
• Number of Gradient Oracles (a) MNIST (b) Fashion-MNIST (c) CIFAR-10.
• As demonstrated in Sinha et al [2018], the authors often choose γ > 0 sufficiently large such that l(x, yi) − γ yi − ξi 2 is strongly concave.
• The authors mainly follow the setting of Sinha et al [2018] and consider training a neural network classifier on three datasets1: MNIST, Fashion-MNIST, and CIFAR-10, with the default cross validation.
• Two-time-scale GDA is denoted as
Results
• The authors present complexity results for two-time-scale GDA and SGDA in the setting of nonconvex-strongly-concave and nonconvex-concave minimax problems.

The algorithmic schemes that the authors study are extremely simple and are presented in Algorithm 1 and 2.
• The authors present complexity results for two-time-scale GDA and SGDA in the setting of nonconvex-strongly-concave and nonconvex-concave minimax problems.
• Classical GDA and SGDA assume that ηx = ηy, and the last iterate is only known convergent in strongly convex-concave problems [Liang and Stokes, 2018].
• Two-time-scale GDA and SGDA were shown to be locally convergent and practical in training GANs [Heusel et al, 2017]
Conclusion
• The authors have shown that two-time-scale GDA and SGDA return an ǫ-stationary point in O(κ2ǫ−2) gradient evaluations and O(κ3ǫ−4) stochastic gradient evaluations in the nonconvex-stronglyconcave case, and O(ǫ−6) gradient evaluations and O(ǫ−8) stochastic gradient evaluations in the nonconvexconcave case.
• These two algorithms are provably efficient in these settings.
• In future work the authors aim to derive a lower bound for the complexity first-order algorithms in nonconvex-concave minimax problems
Summary
• ## Introduction:

The authors consider the following smooth minimax optimization problem: min max f (x, y), x∈Rm y∈Y (1.1)

where f : Rm × Rn → R is nonconvex in x but concave in y and where Y is a convex set.
• One of the simplest candidates for solving problem (1.1) is the natural generalization of gradient descent (GD) known as gradient descent ascent (GDA).
• At each iteration, this algorithm performs gradient descent over the variable x with the stepsize ηx and gradient ascent over the variable y with the stepsize ηy.
• On the positive side, when the objective function f is convex-concave in a pair of
• ## Objectives:

Objectives in this paper

The authors start by defining local surrogate for the global minimum of Φ.
• A common surrogate in nonconvex optimization is the notion of stationarity, which is appropriate if Φ is differentiable.
• Definition 3.3 A point x is an ǫ-stationary point (ǫ ≥ 0) of a differentiable function Φ if ∇Φ(x) ≤ ǫ.
• Definition 3.3 is sufficient for nonconvex-strongly-concave minimax problem since Φ(·) = maxy∈Y f (·, y) is differentiable in that setting.
• A function Φ is not necessarily differentiable for general nonconvex-concave minimax problem even if f is Lipschitz and smooth.
• A weaker condition that the authors make use of is the following
• ## Methods:

The authors present several empirical results to show that two-time-scale GDA outperforms GDmax.
• Number of Gradient Oracles (a) MNIST (b) Fashion-MNIST (c) CIFAR-10.
• As demonstrated in Sinha et al [2018], the authors often choose γ > 0 sufficiently large such that l(x, yi) − γ yi − ξi 2 is strongly concave.
• The authors mainly follow the setting of Sinha et al [2018] and consider training a neural network classifier on three datasets1: MNIST, Fashion-MNIST, and CIFAR-10, with the default cross validation.
• Two-time-scale GDA is denoted as
• ## Results:

The authors present complexity results for two-time-scale GDA and SGDA in the setting of nonconvex-strongly-concave and nonconvex-concave minimax problems.

The algorithmic schemes that the authors study are extremely simple and are presented in Algorithm 1 and 2.
• The authors present complexity results for two-time-scale GDA and SGDA in the setting of nonconvex-strongly-concave and nonconvex-concave minimax problems.
• Classical GDA and SGDA assume that ηx = ηy, and the last iterate is only known convergent in strongly convex-concave problems [Liang and Stokes, 2018].
• Two-time-scale GDA and SGDA were shown to be locally convergent and practical in training GANs [Heusel et al, 2017]
• ## Conclusion:

The authors have shown that two-time-scale GDA and SGDA return an ǫ-stationary point in O(κ2ǫ−2) gradient evaluations and O(κ3ǫ−4) stochastic gradient evaluations in the nonconvex-stronglyconcave case, and O(ǫ−6) gradient evaluations and O(ǫ−8) stochastic gradient evaluations in the nonconvexconcave case.
• These two algorithms are provably efficient in these settings.
• In future work the authors aim to derive a lower bound for the complexity first-order algorithms in nonconvex-concave minimax problems
Tables
• Table1: The gradient complexity of all algorithms for nonconvex-(strongly)-concave minimax problems. ǫ is a tolerance and κ > 0 is a condition number. The result denoted by ⋆ refers to the complexity bound after translating from ǫ-stationary point of f to our optimality measure; see Propositions 4.11 and 4.12. The result denoted by ◦ is not presented explicitly but easily derived by standard arguments
Related work
• Convex-concave setting. Historically, an early concrete instantiation of problem (1.1) involved computing a pair of probability vectors (x, y), or equivalently solving minx∈∆m maxy∈∆n x⊤Ay for a matrix A ∈ Rm×n and probability simplices ∆m and ∆n. This bilinear minimax problem together with von Neumann’s minimax theorem [Neumann, 1928] was a cornerstone in the development of game theory. A general algorithm scheme was developed for solving this problem in which the min and max players each run a simple learning procedure in tandem [Robinson, 1951]. Sion [1958] generalized von Neumann’s result from bilinear games to general convex-concave games, minx maxy f (x, y) = maxy minx f (x, y), and triggered a line of algorithmic research on convex-concave minimax optimization in both continuous time [Kose, 1956, Cherukuri et al, 2017] and discrete time [Uzawa, 1958, Golshtein, 1974, Korpelevich, 1976, Nemirovski, 2004, Nedic and Ozdaglar, 2009, Mokhtari et al, 2019b,a, Azizian et al, 2019]. It is well known that GDA finds an ǫ-approximate stationary point within O(κ2 log(1/ǫ)) iterations for strongly-convex-strongly-concave problems, and O(ǫ−2) iterations with decaying stepsize for convexconcave games [Nedic and Ozdaglar, 2009, Nemirovski, 2004].
Funding
• This work was supported in part by the Mathematical Data Science program of the Office of Naval Research under grant number N00014-18-1-2764
Reference
• S. S. Abadeh, P. M. M. Esfahani, and D. Kuhn. Distributionally robust logistic regression. In NeurIPS, pages 1576–1584, 2015. (Cited on page 1.)
• L. Adolphs, H. Daneshmand, A. Lucchi, and T. Hofmann. Local saddle point optimization: A curvature exploitation approach. ArXiv Preprint: 1805.05751, 2018. (Cited on pages 2 and 4.)
• W. Azizian, I. Mitliagkas, S. Lacoste-Julien, and G. Gidel. A tight and unified analysis of extragradient for a whole spectrum of differentiable games. ArXiv Preprint: 1906.05945, 2019. (Cited on page 3.)
• D. Balduzzi, S. Racaniere, J. Martens, J. Foerster, K. Tuyls, and T. Graepel. The mechanics of n-player differentiable games. ArXiv Preprint: 1802.05642, 2018. (Cited on page 4.)
• T. Basar and G. J. Olsder. Dynamic Noncooperative Game Theory, volume 23. SIAM, 1999. (Cited on page 1.)
• M. Benaım and M. W. Hirsch. Mixed equilibria and dynamical systems arising from fictitious play in perturbed games. Games and Economic Behavior, 29(1-2):36–72, 1999. (Cited on page 2.)
• N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006.
• G. H. G. Chen and R. T. Rockafellar. Convergence rates in forward–backward splitting. SIAM Journal on Optimization, 7(2):421–444, 1997. (Cited on page 2.)
• A. Cherukuri, B. Gharesifard, and J. Cortes. Saddle-point dynamics: conditions for asymptotic stability of saddle points. SIAM Journal on Control and Optimization, 55(1):486–511, 2017. (Cited on pages 2 and 3.)
• C. Daskalakis and I. Panageas. Last-iterate convergence: Zero-sum games and constrained min-max optimization. ArXiv Preprint: 1807.04252, 2018a. (Cited on page 7.)
• C. Daskalakis and I. Panageas. The limit points of (optimistic) gradient descent in min-max optimization. In NeurIPS, pages 9236–9246, 2018b. (Cited on page 4.)
• C. Daskalakis, A. Ilyas, V. Syrgkanis, and H. Zeng. Training gans with optimism. ArXiv Preprint: 1711.00141, 2017. (Cited on pages 2 and 7.)
• D. Davis and D. Drusvyatskiy. Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization, 29(1):207–239, 2019. (Cited on pages 5, 11, and 17.)
• D. Drusvyatskiy and A. S. Lewis. Error bounds, quadratic growth, and linear convergence of proximal methods. Mathematics of Operations Research, 43(3):919–948, 2018. (Cited on page 19.)
• S. S. Du and W. Hu. Linear convergence of the primal-dual gradient method for convex-concave saddle point problems without strong convexity. ArXiv Preprint: 1802.01504, 2018. (Cited on page 2.)
• E. G. Golshtein. Generalized gradient method for finding saddle points. Matekon, 10(3):36–52, 1974.
• I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NeurIPS, pages 2672–2680, 2014. (Cited on pages 1 and 4.)
• P. Grnarova, K. Y. Levy, A. Lucchi, T. Hofmann, and A. Krause. An online learning approach to generative adversarial networks. In ICLR, 20(Cited on pages 3 and 4.)
• M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. GANs trained by a two timescale update rule converge to a local nash equilibrium. In NeurIPS, pages 6626–6637, 2017. (Cited on pages 2, 4, and 7.)
• C. H. Hommes and M. I. Ochea. Multiple equilibria and limit cycles in evolutionary games with logit dynamics. Games and Economic Behavior, 74(1):434–441, 2012. (Cited on page 2.)
• C. Jin, P. Netrapalli, and M. I. Jordan. Minmax optimization: Stable limit points of gradient descent ascent are locally optimal. ArXiv Preprint: 1902.00618, 2019. (Cited on pages 2, 3, 4, and 32.)
• M. I. Jordan. Artificial intelligence–the revolution hasnt happened yet. Medium. Vgl. Ders.(2018): Perspectives and Challenges. Presentation SysML, 2018. (Cited on page 1.)
• A. Juditsky, A. Nemirovski, and C. Tauvel. Solving variational inequalities with stochastic mirror-prox algorithm. Stochastic Systems, 1(1):17–58, 2011. (Cited on page 20.)
• W. Kong and R. D. C. Monteiro. An accelerated inexact proximal point method for solving nonconvexconcave min-max problems. ArXiv Preprint:1905.13433, 2019. (Cited on pages 2, 3, 4, and 10.)
• G. M. Korpelevich. The extragradient method for finding saddle points and other problems. Matecon, 12:747–756, 1976. (Cited on pages 2 and 3.)
• T. Kose. Solutions of saddle value problems by differential equations. Econometrica, Journal of the Econometric Society, pages 59–70, 1956. (Cited on page 3.)
• T. Liang and J. Stokes. Interaction matters: A note on non-asymptotic local convergence of generative adversarial networks. ArXiv Preprint: 1802.06132, 2018. (Cited on pages 2 and 7.)
• Q. Lin, M. Liu, H. Rafique, and T. Yang. Solving weakly-convex-weakly-concave saddle-point problems as weakly-monotone variational inequality. ArXiv Preprint: 1810.10207, 2018. (Cited on page 4.)
• S. Lu, I. Tsaknakis, M. Hong, and Y. Chen. Hybrid block successive approximation for one-sided nonconvex min-max problems: Algorithms and applications. ArXiv Preprint: 1902.08294, 2019. (Cited on pages 2, 3, 4, 6, 9, and 10.)
• A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. ArXiv Preprint: 1706.06083, 2017. (Cited on pages 2 and 4.)
• G. Mateos, J. A. Bazerque, and G. B. Giannakis. Distributed sparse linear regression. IEEE Transactions on Signal Processing, 58(10):5262–5276, 2010. (Cited on page 1.)
• E. V. Mazumdar, M. I. Jordan, and S. S. Sastry. On finding local nash equilibria (and only local nash equilibria) in zero-sum games. ArXiv Preprint: 1901.00838, 2019. (Cited on pages 2 and 4.)
• P. Mertikopoulos, C. Papadimitriou, and G. Piliouras. Cycles in adversarial regularized learning. In SODA, pages 2703–2717. SIAM, 2018. (Cited on pages 2 and 7.)
• P. Mertikopoulos, B. Lecouat, H. Zenati, C-S Foo, V. Chandrasekhar, and G. Piliouras. Optimistic mirror descent in saddle-point problems: Going the extra(-gradient) mile. In ICLR, 2019. (Cited on pages 2 and 4.)
• A. Mokhtari, A. Ozdaglar, and S. Pattathil. Proximal point approximations achieving a convergence rate of o(1/k) for smooth convex-concave saddle point problems: Optimistic gradient and extra-gradient methods. ArXiv Preprint: 1906.01115, 2019a. (Cited on pages 3 and 20.)
• A. Mokhtari, A. Ozdaglar, and S. Pattathil. A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach. ArXiv Preprint: 1901.08511, 2019b.
• H. Namkoong and J. C. Duchi. Stochastic gradient methods for distributionally robust optimization with f-divergences. In NIPS, pages 2208–2216, 2016. (Cited on pages 3 and 4.)
• A. Nedic and A. Ozdaglar. Subgradient methods for saddle-point problems. Journal of Optimization Theory and Applications, 142(1):205–228, 2009. (Cited on pages 2 and 3.)
• A. Nemirovski. Prox-method with rate of convergence o (1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization, 15(1):229–251, 2004. (Cited on pages 2 and 3.)
• Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course, volume 87. Springer Science & Business Media, 2013. (Cited on pages 17 and 18.)
• J. V. Neumann. Zur theorie der gesellschaftsspiele. Mathematische Annalen, 100(1):295–320, 1928.
• N. Nisan, T. Roughgarden, E. Tardos, and V. V. Vazirani. Algorithmic Game Theory. Cambridge University Press, 2007. (Cited on page 1.)
• M. Nouiehed, M. Sanjabi, T. Huang, J. D. Lee, and M. Razaviyayn. Solving a class of non-convex min-max games using iterative first order methods. In NeurIPS, pages 14905–14916, 2019. (Cited on pages 2, 3, 4, 6, 9, 10, and 32.)
• H. Rafique, M. Liu, Q. Lin, and T. Yang. Non-convex min-max optimization: Provable algorithms and applications in machine learning. ArXiv Preprint: 1810.02060, 2018. (Cited on pages 2, 3, and 4.)
• J. Robinson. An iterative method of solving a game. Annals of Mathematics, pages 296–301, 1951.
• R. T. Rockafellar. Convex Analysis. Princeton University Press, 2015. (Cited on pages 17 and 18.) M. Sanjabi, J. Ba, M. Razaviyayn, and J. D. Lee. On the convergence and robustness of training gans with regularized optimal transport. In NeurIPS, pages 7091–7101, 2018. (Cited on pages 3 and 4.) J. Shamma. Cooperative Control of Distributed Multi-agent Systems. John Wiley & Sons, 2008. (Cited on page 1.)
• A. Sinha, H. Namkoong, and J. Duchi. Certifiable distributional robustness with principled adversarial training. In ICLR, 2018. (Cited on pages 1, 2, 3, 4, 11, and 12.)
• M. Sion. On general minimax theorems. Pacific Journal of Mathematics, 8(1):171–176, 1958. (Cited on page 3.)
• K. K. Thekumparampil, P. Jain, P. Netrapalli, and S. Oh. Efficient algorithms for smooth minimax optimization. In NeurIPS, pages 12659–12670, 2019. (Cited on pages 2, 3, 4, 6, and 10.)
• H. Uzawa. Iterative methods for concave programming. Studies in Linear and Nonlinear Programming, 6:154–165, 1958. (Cited on page 3.)
• J. Von Neumann and O. Morgenstern. Theory of Games and Economic Behavior (Commemorative Edition). Princeton University Press, 2007. (Cited on page 1.)
• H. Xu, C. Caramanis, and S. Mannor. Robustness and regularization of support vector machines. Journal of Machine Learning Research, 10(Jul):1485–1510, 2009. (Cited on page 1.)
• Since f is l-smooth, f (x, y) + (l/2) x 2 is convex in x for any y ∈ Y. Since Y is bounded, Danskin’s theorem [Rockafellar, 2015] implies that Ψ(x) is convex. Putting these pieces together yields that Φ(w)+l w − x 2 is (l/2)-strongly convex. This implies that Φ1/2l(x) and proxΦ/2l(x) are well-defined. Furthermore, by the definition of proxΦ/2l(x), we have
• Since y∗(x) is unique and Y is convex and bounded, we conclude from Danskin’s theorem [Rockafellar, 2015] that Φ is differentiable with ∇Φ(x) = ∇xf (x, y∗(x)). Since ∇Φ(x) = ∇xf (x, y∗(x)), we have
• Since f (x, ·) is μ-strongly-concave over Y, the global error bound condition holds [Drusvyatskiy and Lewis, 2018] and μ y − y∗(x) ≤ l PY (y + (1/l)∇yf (x, y)) − y ≤ ǫ/κ.
• The required number of gradient evaluations is O(ǫ−2) [Mokhtari et al., 2019a]. This argument holds for applying stochastic mirror-prox algorithm and the required number of stochastic gradient evaluations is O(ǫ−4) [Juditsky et al., 2011].
• 2. Taking an expectation of both sides of the above equality, conditioned on (xt−1, yt−1), together with Lemma A.2 yields that For the sake of completeness, we present GDmax and SGDmax in Algorithm 3 and 4. For any given xt ∈ Rm, the max-oracle approximately solves maxy∈Y f (xt, y) at each iteration. Although GDmax and SGDmax are easier to understand, they have two disadvantages over two-time-scale GDA and SGDA: 1) Both GDmax and SGDmax are nested-loop algorithms. Since it is difficult to pre-determine the number iterations for the inner loop, these algorithms are not favorable in practice; 2) In the general setting where f (x, ·) is nonconcave, GDmax and SGDmax are inapplicable as we can not efficiently solve the maximization problem to a global optimum. Nonetheless, we present the complexity bound for GDmax and SGDmax for the sake of completeness. Note that a portion of results have been derived before [Jin et al., 2019, Nouiehed et al., 2019] and our proof depends on the same techniques.
Author