# Near-Optimal Algorithms for Minimax Optimization

COLT, pp. 2738-2779, 2020.

Cited by: 49|Views116
EI
Weibo:
This paper has provided the first set of near-optimal algorithms for strongly-convex--concave minimax optimization problems and the state-of-the-art algorithms for nonconvex--concave minimax optimization problems

Abstract:

This paper resolves a longstanding open question pertaining to the design of near-optimal first-order algorithms for smooth and strongly-convex-strongly-concave minimax problems. Current state-of-the-art first-order algorithms find an approximate Nash equilibrium using $\tilde{O}(\kappa_{\mathbf x}+\kappa_{\mathbf y})$ or \$\tilde{O}(\mi...More

Code:

Data:

Full Text
Bibtex
Weibo
Introduction
• Let Rm and Rn be finite-dimensional Euclidean spaces and let the function f : Rm × Rn → R be smooth.
• The best known convergence rate in a general convex-concave setting is O(1/ǫ) in terms of duality gap, which can be achieved by Nemirovski’s mirror-prox algorithm [Nemirovski, 2004], Nesterov’s dual extrapolation algorithm [Nesterov, 2007] or Tseng’s accelerated proximal gradient algorithm [Tseng, 2008]
• This rate is known to be optimal for the class of smooth convex-concave problems [Ouyang and Xu, 2019].
Highlights
• Let Rm and Rn be finite-dimensional Euclidean spaces and let the function f : Rm × Rn → R be smooth
• The theoretical study of solutions of problem (1) has been an focus of several decades of research in mathematics, statistics, economics and computer science [Basar and Olsder, 1999, Nisan et al, 2007, Von Neumann and Morgenstern, 2007, Facchinei and Pang, 2007, Berger, 2013]. This line of research has become increasingly relevant to algorithmic machine learning, with applications including robustness in adversarial learning [Goodfellow et al, 2014, Sinha et al, 2018], prediction and regression problems [Cesa-Bianchi and Lugosi, 2006, Xu et al, 2009] and distributed computing [Shamma, 2008, Mateos et al, 2010]
• Our algorithm extends to the general convex-concave setting, achieving a gradient complexity of O(ǫ−1), which matches the lower bound of Ouyang and Xu [2019] as well as the best existing upper bounds [Nemirovski, 2004, Nesterov, 2007, Tseng, 2008] up to logarithmic factors
• We conclude that APPA has a unique advantage over Accelerated Gradient Descent in settings where g does not have a smoothness property but the proximal step (4) is easy to solve. These settings include LASSO [Beck and Teboulle, 2009], as well as minimax optimization problems
• This paper has provided the first set of near-optimal algorithms for strongly-convex--concave minimax optimization problems and the state-of-the-art algorithms for nonconvex--concave minimax optimization problems
• For the former class of problems, our algorithms match the lower complexity bound for first-order algorithms [Ouyang and Xu, 2019, Ibrahim et al, 2019, Zhang et al, 2019] up to logarithmic factors
Conclusion
• This paper has provided the first set of near-optimal algorithms for strongly-convex--concave minimax optimization problems and the state-of-the-art algorithms for nonconvex--concave minimax optimization problems.
• For the former class of problems, the algorithms match the lower complexity bound for first-order algorithms [Ouyang and Xu, 2019, Ibrahim et al, 2019, Zhang et al, 2019] up to logarithmic factors.
• Despite several striking results on lower complexity bounds for nonconvex smooth problems [Carmon et al, 2019a,b], this problem remains challenging as solving it requires a new construction of “chain-style” functions and resisting oracles
Summary
• ## Introduction:

Let Rm and Rn be finite-dimensional Euclidean spaces and let the function f : Rm × Rn → R be smooth.
• The best known convergence rate in a general convex-concave setting is O(1/ǫ) in terms of duality gap, which can be achieved by Nemirovski’s mirror-prox algorithm [Nemirovski, 2004], Nesterov’s dual extrapolation algorithm [Nesterov, 2007] or Tseng’s accelerated proximal gradient algorithm [Tseng, 2008]
• This rate is known to be optimal for the class of smooth convex-concave problems [Ouyang and Xu, 2019].
• ## Conclusion:

This paper has provided the first set of near-optimal algorithms for strongly-convex--concave minimax optimization problems and the state-of-the-art algorithms for nonconvex--concave minimax optimization problems.
• For the former class of problems, the algorithms match the lower complexity bound for first-order algorithms [Ouyang and Xu, 2019, Ibrahim et al, 2019, Zhang et al, 2019] up to logarithmic factors.
• Despite several striking results on lower complexity bounds for nonconvex smooth problems [Carmon et al, 2019a,b], this problem remains challenging as solving it requires a new construction of “chain-style” functions and resisting oracles
Tables
• Table1: Comparison of gradient complexities to find an ǫ-saddle point (Definition 3.4) in the convexconcave setting. This table highlights only the dependency on error tolerance ǫ and the strong-convexity and strong-concavity condition numbers, κx, κy
• Table2: Comparison of gradient complexities to find an ǫ-stationary point of f (Definition 3.5) or ǫ-stationary point of Φ(·) := maxy∈Y f (·, y) (Definition A.1, A.5) in the nonconvex-concave settings. This table only highlights the dependence on tolerance ǫ and the condition number κy
Related work
Funding
• This work was supported in part by the Mathematical Data Science program of the Office of Naval Research under grant number N00014-18-1-2764. J
Reference
• This paper (Theorem 5.1) Lower bound [Ibrahim et al., 2019] Lower bound [Zhang et al., 2019]
• Lower bound [Ouyang and Xu, 2019]
• This paper (Corollary 5.3) Lower bound [Ouyang and Xu, 2019]
• To the best of our knowledge, the earliest algorithmic schemes for solving the bilinear minimax problem, minx∈∆m maxy∈∆n x⊤Ay, date back to Brown’s fictitious play [Brown, 1951] and Dantzig’s simplex method [Dantzig, 1998]. This problem can also be solved by Korpelevich’s extragradient (EG) algorithm [Korpelevich, 1976], which can be shown to be linearly convergent when A is square and full rank [Tseng, 1995]. There are also several recent papers studying the convergence of EG and its variants; see Chambolle and Pock [2011], Malitsky [2015], Yadav et al. [2018] for reflected gradient descent ascent, Daskalakis et al. [2018], Mokhtari et al. [2019b,a] for optimistic gradient descent ascent (OGDA) and Rakhlin and Sridharan [2013a,b], Mertikopoulos et al. [2019], Chavdarova et al. [2019], Hsieh et al. [2019], Mishchenko et al. [2019] for other variants. In the bilinear setting, Daskalakis et al. [2018] es-
• [2016] and Kolossoski and Monteiro [2017] proved that such result also hold when X, Y are unbounded or the space is non-Euclidean. Chen et al. [2014, 2017] generalized Nesterov’s technique to develop optimal algorithms for solving a class of stochastic saddle point problems and stochastic monotone variational inequalities. For a class of certain purely bilinear games where g and h are zero functions, Azizian et al. [2020] demonstrated that linear convergence is possible for several algorithms and their new algorithm achieved the tight bound. The second case is the so-called affinely constrained smooth convex problem, i.e., minx∈X g(x), s.t. Ax = u. Esser et al. [2010] proposed a O(ǫ−1) primaldual algorithm while Lan and Monteiro [2016] provided a first-order augmented Lagrangian method with the same O(ǫ−1) rate. By exploiting the structure, Ouyang et al. [2015] proposed a near-optimal algorithm in this setting. For strongly-convex-concave minimax problems, the best known general lower bound for first-order algorithm is O( κx/ǫ), as shown by Ouyang and Xu [2019]. Several papers have studied stronglyconvex-concave minimax problem with additional structures. This includex optimizing a strongly convex function with linear constraints [Goldstein et al., 2014, Xu and Zhang, 2018, Xu, 2019], the case when x and y are connected only through a bilinear term x⊤Ay [Nesterov, 2005, Chambolle and Pock, 2016, Xie and Shi, 2019] and the case when f (x, ·) is linear for each x ∈ Rm [Juditsky and Nemirovski, 2011, Hamedani and Aybat, 2018, Zhao, 2019]. to return an ǫ-saddle point with a gradient
• Tcohme pallegxoirtiythomf Os(d1e/v√elǫo)paenddinsotmheesoefwthorekms were even all guaranteed achieve a nearoptimal gradient complexity of O( κx/ǫ) [Nesterov, 2005, Chambolle and Pock, 2016]. However, √the best known upper complexity bound for general strongly-convex-concave minimax problems is O(κx/ ǫ)
• which was shown using the dual implicit accelerated gradient algorithm [Thekumparampil et al., 2019].
• Convex-concave setting: we assume that f (·, y) is convex for each y ∈ Y and f (x, ·) is concave for each x ∈ X. Here X and Y are both convex and bounded. Under these conditions, the Sion’s minimax theorem [Sion, 1958] guarantees that max min f (x, y) = min max f (x, y).
• Nonconvex-concave setting: we only assume that f (x, ·) is concave for each x ∈ Rm. The function f (·, y) can be possibly nonconvex for some y ∈ Y. Here X is convex but possibly unbounded while Y is convex and bounded. In general, finding a global Nash equilibrium of f is intractable since in the special case where Y has only a single element, this problem reduces to a nonconvex optimization problem in which finding a global minimum is already NP-hard [Murty and Kabadi, 1987]. Similar to the literature in nonconvex constrained optimization, we opt to find local surrogates—stationary points—whose gradient mappings are zero. Formally, we define our optimality criterion as follows.
• Nesterov’s Accelerated Gradient Descent (AGD) dates back to Nesterov’s seminal paper [Nesterov, 1983] where it is shown to be optimal among all the first-order algorithms for smooth and convex functions [Nesterov, 2018]. We present a version of AGD in Algorithm 1 which is frequently used to minimize an l-smooth and μ-strongly convex function g over a convex set X. The key steps of the AGD algorithm are Line 5-6, where Lines 5 performs a projected gradient descent step, while Line 6 performs a momentum step, which “overshoots” the iterate in the direction of momentum (xt − xt−1). Line 7 is the stopping condition to ensure that the output achieves the desired optimality.
• We conclude that APPA has a unique advantage over AGD in settings where g does not have a smoothness property but the proximal step (4) is easy to solve. These settings include LASSO [Beck and Teboulle, 2009], as well as minimax optimization problems (as we show in later sections).
• Theorem 4.2 claims that Algorithm 3 finds convex-strongly-concave functions. This rate 2019, Zhang et al., 2019]. At a high level, it dtaaonkeesǫs-noAoptGtmiDmaatOlch(p√tohκinext)loiswnteeOprsb(κotxou√nsdoκlyΩv)e(√itthκeerxaκitnyino)ne[rsIbmfroairnhiismmtriozeanttgialoyln.-, problem and compute Ψ(y):= minx∈X g(x, y). Despite the fact that the function g is l-smooth, function
• Theorem 5.1 asserts that Algorithm 4 finds ǫ-saddle points in O(√κxκy) gradient evaluations, matching the lower bound [Ibrahim et al., 2019, Zhang et al., 2019], up to logarithmic factors. At a high level, duesisnpgiteOt(h√eκfxu)ncitteiorantiΦonhsavaicncgorudnindgestiroabTlheesomreomoth4n.1e,ssrepgraorpdelretsiseso,fAtPhePAsmmoointihmniezsess the discussion in Section 4.2, Maximin-AG2 solves the proximal step in the inner
• Our algorithm for nonconvex-strongly-concave optimization is described in Algorithm 5. Similar to Algorithm 4, we still use our accelerated solver Maximin-AG2 for the same proximal subproblem in the inner loop. The only minor difference is that, in the outer loop, Algorithm 5 only uses the Proximal Point Algorithm (PPA) on function Φ(·):= maxy∈Y f (·, y) without acceleration (or momentum steps). This is due to fact that gradient descent is already optimal among all first-order algorithm for finding stationary points of smooth nonconvex functions [Carmon et al., 2019a]. The standard acceleration technique will not help for smooth nonconvex functions. We presents the theoretical guarantees for Algorithm 5 in the following theorem.
• This paper has provided the first set of near-optimal algorithms for strongly-convex-(strongly)-concave minimax optimization problems and the state-of-the-art algorithms for nonconvex-(strongly)-concave minimax optimization problems. For the former class of problems, our algorithms match the lower complexity bound for first-order algorithms [Ouyang and Xu, 2019, Ibrahim et al., 2019, Zhang et al., 2019] up to logarithmic factors. For the latter class of problems, our algorithms achieve the best known upper bound. In the future research, one important direction is to investigate the lower complexity bound of first-order algorithms for nonconvex-(strongly)-concave minimax problems. Despite several striking results on lower complexity bounds for nonconvex smooth problems [Carmon et al., 2019a,b], this problem remains challenging as solving it requires a new construction of “chain-style” functions and resisting oracles.
• J. Abernethy, K. A. Lai, and A. Wibisono. Last-iterate convergence rates for min-max optimization. ArXiv Preprint: 1906.02027, 2019. (Cited on page 4.)
• M. Alkousa, D. Dvinskikh, F. Stonyakin, and A. Gasnikov. Accelerated methods for composite nonbilinear saddle point problem. ArXiv Preprint: 1906.03620, 2019. (Cited on pages 1, 2, 3, and 5.)
• A. Auslender and M. Teboulle. Interior projection-like methods for monotone variational inequalities. Mathematical programming, 104(1):39–68, 2005. (Cited on page 4.)
• W. Azizian, D. Scieur, I. Mitliagkas, S. Lacoste-Julien, and G. Gidel. Accelerating smooth games by manipulating spectral shapes. ArXiv Preprint: 2001.00602, 2020. (Cited on page 5.)
• T. Basar and G. J. Olsder. Dynamic Noncooperative Game Theory, volume 23. SIAM, 1999. (Cited on page 1.)
• A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Science, 2(1):183–202, 2009. (Cited on page 9.)
• J. O. Berger. Statistical Decision Theory and Bayesian Analysis. Springer Science & Business Media, 2013. (Cited on page 1.)
• G. W. Brown. Iterative solution of games by fictitious play. Activity Analysis of Production and Allocation, 13(1):374–376, 1951. (Cited on page 3.)
• Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. Lower bounds for finding stationary points i. Mathematical Programming, Jun 2019a. ISSN 1436-4646. doi: 10.1007/s10107-019-01406-y. URL https://doi.org/10.1007/s10107-019-01406-y. (Cited on pages 13 and 14.)
• Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. Lower bounds for finding stationary points ii: first-order methods. Mathematical Programming, Sep 2019b. ISSN 1436-4646. doi: 10.1007/ s10107-019-01431-x. URL https://doi.org/10.1007/s10107-019-01431-x. (Cited on page 14.)
• N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006.
• A. Chambolle and T. Pock. A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision, 40(1):120–145, 2011. (Cited on pages 3 and 4.)
• A. Chambolle and T. Pock. On the ergodic convergence rates of a first-order primal–dual algorithm. Mathematical Programming, 159(1-2):253–287, 2016. (Cited on pages 2 and 5.)
• T. Chavdarova, G. Gidel, F. Fleuret, and S. Lacoste-Julien. Reducing noise in GAN training with variance reduced extragradient. ArXiv Preprint: 1904.08598, 2019. (Cited on page 3.)
• Y. Chen, G. Lan, and Y. Ouyang. Optimal primal-dual methods for a class of saddle point problems. SIAM Journal on Optimization, 24(4):1779–1814, 2014. (Cited on pages 2 and 5.)
• Y. Chen, G. Lan, and Y. Ouyang. Accelerated schemes for a class of variational inequalities. Mathematical Programming, 165(1):113–149, 2017. (Cited on page 5.)
• G. B. Dantzig. Linear Programming and Extensions. Princeton University Press, 1998. (Cited on page 3.)
• C. Daskalakis, A. Ilyas, V. Syrgkanis, and H. Zeng. Training GANs with optimism. In ICLR, 2018.
• D. Davis and D. Drusvyatskiy. Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization, 29(1):207–239, 2019. (Cited on pages 21, 36, and 40.)
• E. Esser, X. Zhang, and T. F. Chan. A general framework for a class of first order primal-dual algorithms for convex optimization in imaging science. SIAM Journal on Imaging Sciences, 3(4):1015–1046, 2010.
• F. Facchinei and J-S. Pang. Finite-dimensional Variational Inequalities and Complementarity Problems. Springer Science & Business Media, 2007. (Cited on page 1.)
• G. Gidel, H. Berard, G. Vignoud, P. Vincent, and S. Lacoste-Julien. A variational inequality perspective on generative adversarial networks. In ICLR, 2019. (Cited on pages 2, 3, and 5.)
• T. Goldstein, B. O’Donoghue, S. Setzer, and R. Baraniuk. Fast alternating direction optimization methods. SIAM Journal on Imaging Sciences, 7(3):1588–1623, 2014. (Cited on page 5.)
• I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NeurIPS, pages 2672–2680, 2014. (Cited on page 1.)
• P. Grnarova, K. Y. Levy, A. Lucchi, T. Hofmann, and A. Krause. An online learning approach to generative adversarial networks. In ICLR, 2018. (Cited on page 6.)
• E. Y. Hamedani and N. S. Aybat. A primal-dual algorithm for general convex-concave saddle point problems. ArXiv Preprint: 1803.01401, 2018. (Cited on pages 2, 3, and 5.)
• Y. He and R. D. C. Monteiro. An accelerated hpe-type algorithm for a class of composite convex-concave saddle-point problems. SIAM Journal on Optimization, 26(1):29–56, 2016. (Cited on page 5.)
• Y-G. Hsieh, F. Iutzeler, J. Malick, and P. Mertikopoulos. On the convergence of single-call stochastic extra-gradient methods. In NeurIPS, pages 6936–6946, 2019. (Cited on page 3.)
• A. Ibrahim, W. Azizian, G. Gidel, and I. Mitliagkas. Lower bounds and conditioning of differentiable games. ArXiv Preprint: 1906.07300, 2019. (Cited on pages 1, 2, 3, 5, 10, 11, and 14.)
• C. Jin, P. Netrapalli, and M. I. Jordan. Minmax optimization: Stable limit points of gradient descent ascent are locally optimal. ArXiv Preprint: 1902.00618, 2019. (Cited on pages 2, 4, and 5.)
• M. I. Jordan. Artificial intelligence–the revolution hasnt happened yet. Medium. Vgl. Ders.(2018): Perspectives and Challenges. Presentation SysML, 2018. (Cited on page 1.)
• A. Juditsky and A. Nemirovski. First order methods for nonsmooth convex large-scale optimization, ii: utilizing problems structure. Optimization for Machine Learning, 30(9):149–183, 2011. (Cited on pages 2, 3, and 5.)
• O. Kolossoski and R. D. C. Monteiro. An accelerated non-euclidean hybrid proximal extragradienttype algorithm for convex-concave saddle-point problems. Optimization Methods and Software, 32 (6):1244–1272, 2017. (Cited on page 5.)
• W. Kong and R. D. C. Monteiro. An accelerated inexact proximal point method for solving nonconvexconcave min-max problems. ArXiv Preprint: 1905.13433, 2019. (Cited on pages 2, 5, and 6.)
• G. M. Korpelevich. The extragradient method for finding saddle points and other problems. Matecon, 12:747–756, 1976. (Cited on pages 2 and 3.)
• G. Lan and R. D. C. Monteiro. Iteration-complexity of first-order augmented lagrangian methods for convex programming. Mathematical Programming, 155(1-2):511–547, 2016. (Cited on page 5.)
• T. Liang and J. Stokes. Interaction matters: A note on non-asymptotic local convergence of generative adversarial networks. In AISTATS, pages 907–915, 2019. (Cited on page 4.)
• T. Lin, C. Jin, and M. I. Jordan. On gradient descent ascent for nonconvex-concave minimax problems. ArXiv Preprint: 1906.00331, 2019. (Cited on pages 2, 4, and 5.)
• S. Lu, I. Tsaknakis, M. Hong, and Y. Chen. Hybrid block successive approximation for one-sided nonconvex min-max problems: algorithms and applications. ArXiv Preprint: 1902.08294, 2019. (Cited on pages 2, 4, and 6.)
• Y. Malitsky. Projected reflected gradient methods for monotone variational inequalities. SIAM Journal on Optimization, 25(1):502–520, 2015. (Cited on page 3.)
• G. Mateos, J. A. Bazerque, and G. B. Giannakis. Distributed sparse linear regression. IEEE Transactions on Signal Processing, 58(10):5262–5276, 2010. (Cited on page 1.)
• P. Mertikopoulos, B. Lecouat, H. Zenati, C-S. Foo, V. Chandrasekhar, and G. Piliouras. Optimistic mirror descent in saddle-point problems: Going the extra(-gradient) mile. In ICLR, 2019. (Cited on page 3.)
• K. Mishchenko, D. Kovalev, E. Shulgin, P. Richtarik, and Y. Malitsky. Revisiting stochastic extragradient. ArXiv Preprint: 1905.11373, 2019. (Cited on page 3.)
• A. Mokhtari, A. Ozdaglar, and S. Pattathil. Proximal point approximations achieving a convergence rate of o(1/k) for smooth convex-concave saddle point problems: Optimistic gradient and extra-gradient methods. ArXiv Preprint: 1906.01115, 2019a. (Cited on page 3.)
• A. Mokhtari, A. Ozdaglar, and S. Pattathil. A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach. ArXiv Preprint: 1901.08511, 2019b.
• R. D. C. Monteiro and B. F. Svaiter. On the complexity of the hybrid proximal extragradient method for the iterates and the ergodic mean. SIAM Journal on Optimization, 20(6):2755–2787, 2010. (Cited on page 4.)
• R. D. C. Monteiro and B. F. Svaiter. Complexity of variants of tseng’s modified fb splitting and korpelevich’s methods for hemivariational inequalities with applications to saddle-point and convex optimization problems. SIAM Journal on Optimization, 21(4):1688–1720, 2011. (Cited on page 4.)
• K. G. Murty and S. N. Kabadi. Some np-complete problems in quadratic and nonlinear programming. Mathematical Programming: Series A and B, 39(2):117–129, 1987. (Cited on pages 7 and 20.)
• H. Namkoong and J. C. Duchi. Stochastic gradient methods for distributionally robust optimization with f-divergences. In NIPS, pages 2208–2216, 2016. (Cited on page 6.)
• A. Nedic and A. Ozdaglar. Subgradient methods for saddle-point problems. Journal of Optimization Theory and Applications, 142(1):205–228, 2009. (Cited on page 4.)
• A. Nemirovski. Prox-method with rate of convergence o (1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization, 15(1):229–251, 2004. (Cited on pages 2, 3, and 4.)
• Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical programming, 103(1):127– 152, 2005. (Cited on pages 2 and 5.)
• Y. Nesterov. Dual extrapolation and its applications to solving variational inequalities and related problems. Mathematical Programming, 109(2-3):319–344, 2007. (Cited on pages 2, 3, and 4.)
• Y. Nesterov. Gradient methods for minimizing composite functions. Mathematical Programming, 140 (1):125–161, 2013. (Cited on page 7.)
• Springer, 2018. (Cited on pages 8, 22, 23, 29, 30, 33, and 36.)
• Y. Nesterov and L. Scrimali. Solving strongly monotone variational and quasi-variational inequalities. Available at SSRN 970903, 2006. (Cited on pages 3 and 5.)
• Y. E. Nesterov. A method for solving the convex programming problem with convergence rate o(1/k2). In Dokl. Akad. Nauk Sssr, volume 269, pages 543–547, 1983. (Cited on page 8.)
• N. Nisan, T. Roughgarden, E. Tardos, and V. V. Vazirani. Algorithmic Game Theory. Cambridge University Press, 2007. (Cited on page 1.)
• M. Nouiehed, M. Sanjabi, T. Huang, J. D. Lee, and M. Razaviyayn. Solving a class of non-convex min-max games using iterative first order methods. In NeurIPS, pages 14905–14916, 2019. (Cited on pages 2, 4, and 6.)
• D. M. Ostrovskii, A. Lowy, and M. Razaviyayn. Efficient search of first-order nash equilibria in nonconvex-concave smooth min-max problems. ArXiv Preprint: 2002.07919, 2020. (Cited on pages 4 and 6.)
• Y. Ouyang and Y. Xu. Lower complexity bounds of first-order methods for convex-concave bilinear saddle-point problems. Mathematical Programming, Aug 2019. ISSN 1436-4646. doi: 10.1007/ s10107-019-01420-0. URL https://doi.org/10.1007/s10107-019-01420-0. (Cited on pages 2, 3, 5, and 14.)
• Y. Ouyang, Y. Chen, G. Lan, and E. Pasiliao Jr. An accelerated linearized alternating direction method of multipliers. SIAM Journal on Imaging Sciences, 8(1):644–681, 2015. (Cited on pages 2 and 5.)
• H. Rafique, M. Liu, Q. Lin, and T. Yang. Non-convex min-max optimization: Provable algorithms and applications in machine learning. ArXiv Preprint: 1810.02060, 2018. (Cited on pages 2, 4, and 5.)
• A. Rakhlin and K. Sridharan. Online learning with predictable sequences. In COLT, pages 993–1019, 2013a. (Cited on page 3.)
• S. Rakhlin and K. Sridharan. Optimization, learning, and games with predictable sequences. In NIPS, pages 3066–3074, 2013b. (Cited on page 3.)
• R. T. Rockafellar. Convex Analysis, volume 28. Princeton University Press, 1970. (Cited on pages 41 and 42.)
• M. Sanjabi, M. Razaviyayn, and J. D. Lee. Solving non-convex non-concave min-max games under polyak-lojasiewicz condition. ArXiv Preprint: 1812.02878, 2018. (Cited on page 6.)
• J. Shamma. Cooperative Control of Distributed Multi-agent Systems. John Wiley & Sons, 2008. (Cited on page 1.)
• A. Sinha, H. Namkoong, and J. Duchi. Certifiable distributional robustness with principled adversarial training. In ICLR, 2018. (Cited on pages 1 and 6.)
• M. Sion. On general minimax theorems. Pacific Journal of Mathematics, 8(1):171–176, 1958. (Cited on page 7.)
• K. K. Thekumparampil, P. Jain, P. Netrapalli, and S. Oh. Efficient algorithms for smooth minimax optimization. In NeurIPS, pages 12659–12670, 2019. (Cited on pages 2, 3, 4, and 5.)
• P. Tseng. On linear convergence of iterative methods for the variational inequality problem. Journal of Computational and Applied Mathematics, 60(1-2):237–252, 1995. (Cited on pages 1, 2, 3, and 5.)
• P. Tseng. On accelerated proximal gradient methods for convex-concave optimization. submitted to SIAM Journal on Optimization, 2:3, 2008. (Cited on pages 2, 3, and 4.)
• J. Von Neumann and O. Morgenstern. Theory of Games and Economic Behavior (Commemorative Edition). Princeton University Press, 2007. (Cited on page 1.)
• Z. Xie and J. Shi. Accelerated primal dual method for a class of saddle point problem with strongly convex component. ArXiv Preprint: 1906.07691, 2019. (Cited on page 5.)
• H. Xu, C. Caramanis, and S. Mannor. Robustness and regularization of support vector machines. Journal of Machine Learning Research, 10(Jul):1485–1510, 2009. (Cited on page 1.)
• Y. Xu. Iteration complexity of inexact augmented lagrangian methods for constrained convex programming. Mathematical Programming, Aug 2019. ISSN 1436-4646. doi: 10.1007/s10107-019-01425-9. URL https://doi.org/10.1007/s10107-019-01425-9. (Cited on page 5.)
• Y. Xu and S. Zhang. Accelerated primal-dual proximal block coordinate updating methods for constrained convex optimization. Computational Optimization and Applications, 70(1):91–128, 2018.
• A. Yadav, S. Shah, Z. Xu, D. Jacobs, and T. Goldstein. Stabilizing adversarial nets with prediction methods. In ICLR, 2018. (Cited on page 3.)
• J. Zhang, M. Hong, and S. Zhang. On lower iteration complexity bounds for the saddle point problems. ArXiv Preprint: 1912.07481, 2019. (Cited on pages 1, 2, 3, 5, 10, 11, and 14.)
• R. Zhao. Optimal algorithms for stochastic three-composite convex-concave saddle point problems. ArXiv Preprint: 1903.01687, 2019. (Cited on pages 3 and 5.)
• R. Zhao. A primal dual smoothing framework for max-structured nonconvex optimization. ArXiv Preprint: 2003.04375, 2020. (Cited on pages 4 and 6.)
• We present another optimality notion based on Moreau envelope for nonconvex-concave setting in which f (·, y) is not necessarily convex for each y ∈ Y but f (x, ·) is concave for each x ∈ X. For simplicity, we let X = Rm and Y be convex and bounded. In general, finding a global saddle point of f is intractable since solving the special case with a singleton Y globally is already NP-hard [Murty and Kabadi, 1987] as mentioned in the main text.
• This means that finding a sufficiently accurate solution under such optimality notion is as difficult as solving the minimization exactly. Another popular optimality notion is based on the Moreau envelope of Φ when Φ is weakly convex [Davis and Drusvyatskiy, 2019].
• 2. Lemma A.4 (Properties of Moreau envelopes) If the function Φ(·) is l-weakly convex, its Moreau envelope Φ1/2l(·) is 4l-smooth with the gradient ∇Φ1/2l(·) = 2l(· − proxΦ/2l(·)) in which a point proxΦ/2l(·) = argminw∈Rm{Φ(w) + l w − · 2} is defined. Part III. We proceed to derive the gradient complexity of the algorithm using the condition in Eq. (14). Since Algorithm 1 is exactly Nesterov’s accelerated gradient descent, standard arguments based on estimate sequence [Nesterov, 2018] implies g(xt) − min g(x) ≤
• 2. Putting these pieces together yields that g(x)
• 2. Since Φg is 2κyl-smooth and μx-strongly convex, we have (x − x)⊤(2κyl(x − xT ) + ∇Φg(xT )) ≤ 2κyl(x − xT )⊤(x − xT ) + Φg(x) − Φg(x) The remaining proof is based on the modification of Nesterov’s techniques [Nesterov, 2018, Section 2.2.5]. Indeed, we define the estimate sequence as follows, Γ0(y)