Stochastic Optimization for Non-convex Inf-Projection Problems

ICML, pp. 10660-10669, 2019.

Cited by: 0|Bibtex|Views61
EI
Other Links: arxiv.org|academic.microsoft.com|dblp.uni-trier.de
Weibo:
Under the condition that is Lipschitz continuous, we prove the convergence of mini-batch stochastic proximal gradient method with increasing mini-batch size that employ parallel stochastic gradient updates for x and y, and establish the convergence rate

Abstract:

In this paper, we study a family of non-convex and possibly non-smooth inf-projection minimization problems, where the target objective function is equal to minimization of a joint function over another variable. This problem includes difference of convex (DC) functions and a family of bi-convex functions as special cases. We develop st...More

Code:

Data:

0
Introduction
  • The authors consider a general family of non-convex and possibly non-smooth problems that can be written as following: min F (x) =: {g(x) + min h(y) − y, (x) }, (1)

    x∈X y∈dom(h) where X ⊆ Rd is a closed convex set, g : X → R is lower-semicontinuous, h : dom(h) → R is uniformly convex, : X → Rm is a lower-semicontinuous differentiable mapping.
  • Yan Xu Zhang Wang Yang
Highlights
  • In this paper, we consider a general family of non-convex and possibly non-smooth problems that can be written as following: min F (x) =: {g(x) + min h(y) − y, (x) }, (1)

    x∈X y∈dom(h) where X ⊆ Rd is a closed convex set, g : X → R is lower-semicontinuous, h : dom(h) → R is uniformly convex, : X → Rm is a lower-semicontinuous differentiable mapping
  • Under the condition that is Lipschitz continuous, we prove the convergence of mini-batch stochastic proximal gradient method (MSPG) with increasing mini-batch size that employ parallel stochastic gradient updates for x and y, and establish the convergence rate
  • We develop an algorithmic framework that employs a suitable stochastic algorithm for solving strongly convex functions in a stagewise manner
  • The novelty and significance of our results are (i) this is the first work that comprehensively studies the stochastic optimization of a non-smooth non-convex inf-projection problem; the application in variance-based regularization demonstrates much faster convergence of our algorithms comparing with existing algorithms based on a min-max formulation
  • In order to obtain a provable convergence guarantee, we consider the following strongly convex problem from some γ > 0 and x0 ∈ X, whose objective function is an upper bound of the function in (5) at x0: P
  • Proposed a min-max formulation based on distributionally robust optimization given below and proposed stochastic algorithms for solving the resulting min-max formulation when the loss function is convex (Namkoong and Duchi, 2016), n min max { Pi (θ, Xi)] : Dφ(P ||Pn) ≤ ρ}, (18)
Results
  • The novelty and significance of the results are (i) this is the first work that comprehensively studies the stochastic optimization of a non-smooth non-convex inf-projection problem; the application in variance-based regularization demonstrates much faster convergence of the algorithms comparing with existing algorithms based on a min-max formulation.
  • In order to obtain a provable convergence guarantee, the authors consider the following strongly convex problem from some γ > 0 and x0 ∈ X, whose objective function is an upper bound of the function in (5) at x0: P
  • Let them first consider the convergence of SPG for solving H(z) = f (z) + R(z), where f (z) is a convex function and
  • Remark: The authors can apply the above convergence guarantee of SPG to fxk(x) and fyk(y) employed by St-SPG at each stage under favorable conditions regarding their stochastic gradients.
  • Both of which are well-defined and unique due to the strong convexity of Hk. Recall that a stochastic gradient of fxk(x) can be computed by ∂g(x; ξg) − ∇ yk for dom(h) ⊆ Rm+ or ∂g(x; ξg) − ∇ (x; ξ ) yk for dom(h) ⊆ Rm− .
  • The authors would like to mention that the SPG algorithm for solving subproblems in Algorithm 2 can be replaced by other suitable stochastic optimization algorithms for solving a strongly convex problem similar to the developments in Xu et al (2018a) for minimizing DC functions.
  • One can use adaptive stochastic gradient methods in order to enjoy an adaptive convergence, and one can use variance reduction methods if the involved functions are smooth and have a finite-sum structure to achieve an improved convergence.
  • Proposed a min-max formulation based on distributionally robust optimization given below and proposed stochastic algorithms for solving the resulting min-max formulation when the loss function is convex (Namkoong and Duchi, 2016), n min max { Pi (θ, Xi)] : Dφ(P ||Pn) ≤ ρ}, (18)
Conclusion
  • Since the loss is non-convex, the authors compare MSPG with proximally guided stochastic mirror descent (PGSMD) Rafique et al (2018) and its efficient variant for solving the min-max formulation that is non-convex and concave, where the efficient variant is implemented with the same modified constraint on P and BST as BMD-eff.
  • The proposed stochastic algorithms have significant improvement in the convergence time of training/testing errors, especially on large datasets, covtype, RCV1 and URL, which can be verified by comparing convergence of training/testing errors against cpu time.
Summary
  • The authors consider a general family of non-convex and possibly non-smooth problems that can be written as following: min F (x) =: {g(x) + min h(y) − y, (x) }, (1)

    x∈X y∈dom(h) where X ⊆ Rd is a closed convex set, g : X → R is lower-semicontinuous, h : dom(h) → R is uniformly convex, : X → Rm is a lower-semicontinuous differentiable mapping.
  • Yan Xu Zhang Wang Yang
  • The novelty and significance of the results are (i) this is the first work that comprehensively studies the stochastic optimization of a non-smooth non-convex inf-projection problem; the application in variance-based regularization demonstrates much faster convergence of the algorithms comparing with existing algorithms based on a min-max formulation.
  • In order to obtain a provable convergence guarantee, the authors consider the following strongly convex problem from some γ > 0 and x0 ∈ X, whose objective function is an upper bound of the function in (5) at x0: P
  • Let them first consider the convergence of SPG for solving H(z) = f (z) + R(z), where f (z) is a convex function and
  • Remark: The authors can apply the above convergence guarantee of SPG to fxk(x) and fyk(y) employed by St-SPG at each stage under favorable conditions regarding their stochastic gradients.
  • Both of which are well-defined and unique due to the strong convexity of Hk. Recall that a stochastic gradient of fxk(x) can be computed by ∂g(x; ξg) − ∇ yk for dom(h) ⊆ Rm+ or ∂g(x; ξg) − ∇ (x; ξ ) yk for dom(h) ⊆ Rm− .
  • The authors would like to mention that the SPG algorithm for solving subproblems in Algorithm 2 can be replaced by other suitable stochastic optimization algorithms for solving a strongly convex problem similar to the developments in Xu et al (2018a) for minimizing DC functions.
  • One can use adaptive stochastic gradient methods in order to enjoy an adaptive convergence, and one can use variance reduction methods if the involved functions are smooth and have a finite-sum structure to achieve an improved convergence.
  • Proposed a min-max formulation based on distributionally robust optimization given below and proposed stochastic algorithms for solving the resulting min-max formulation when the loss function is convex (Namkoong and Duchi, 2016), n min max { Pi (θ, Xi)] : Dφ(P ||Pn) ≤ ρ}, (18)
  • Since the loss is non-convex, the authors compare MSPG with proximally guided stochastic mirror descent (PGSMD) Rafique et al (2018) and its efficient variant for solving the min-max formulation that is non-convex and concave, where the efficient variant is implemented with the same modified constraint on P and BST as BMD-eff.
  • The proposed stochastic algorithms have significant improvement in the convergence time of training/testing errors, especially on large datasets, covtype, RCV1 and URL, which can be verified by comparing convergence of training/testing errors against cpu time.
Tables
  • Table1: Summary of results for finding a (nearly) -stationary solution in this work under different conditions of g, h and . SM means smoothness, Lip. means Lipschitz continuous, Diff means differentiable, MO means monotonically increasing or decreasing for h∗, CVX means convex, and UC means p-uniformly convex (p ≥ 2), ν = 1/(p − 1)
  • Table2: Data statistics. #Examples #Features #pos:#neg
Download tables as Excel
Reference
  • Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
    Google ScholarFindings
  • Zaiyi Chen, Tianbao Yang, Jinfeng Yi, Bowen Zhou, and Enhong Chen. Universal stagewise learning for non-convex problems with convergence on averaged solutions. CoRR, /abs/1808.06296, 2018.
    Findings
  • Damek Davis and Dmitriy Drusvyatskiy. Stochastic model-based minimization of weakly convex functions. CoRR, abs/1803.06523, 2018a.
    Findings
  • Damek Davis and Dmitriy Drusvyatskiy. Stochastic subgradient method converges at the rate o(k−1/4) on weakly convex functions. CoRR, /abs/1802.02988, 2018b.
    Findings
  • Damek Davis and Benjamin Grimmer. Proximally guided stochastic subgradient method for nonsmooth, nonconvex problems. arXiv preprint arXiv:1707.03505, 2017.
    Findings
  • Saeed Ghadimi, Guanghui Lan, and Hongchao Zhang. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Math. Program., 155(1-2): 267–305, 2016.
    Google ScholarLocate open access versionFindings
  • Jochen Gorski, Frank Pfeuffer, and Kathrin Klamroth. Biconvex sets and optimization with biconvex functions: a survey and extensions. Mathematical Methods of Operations Research, 66(3):373–407, Dec 2007.
    Google ScholarLocate open access versionFindings
  • Ryuichi Kiryo, Gang Niu, Marthinus C du Plessis, and Masashi Sugiyama. Positiveunlabeled learning with non-negative risk estimator. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 1675–1685. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/6765-positive-unlabeled-learning-with-non-negative-risk-estimator.pdf.
    Locate open access versionFindings
  • M. Pawan Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning for latent variable models. In Neural Information Processing Systems 23, pages 1189–1197, 2010.
    Google ScholarLocate open access versionFindings
  • Andreas Maurer and Massimiliano Pontil. Empirical bernstein bounds and sample variance penalization. arXiv preprint arXiv:0907.3740, 2009.
    Findings
  • Hongseok Namkoong and John C. Duchi. Stochastic gradient methods for distributionally robust optimization with f-divergences. In Advances in Neural Information Processing Systems (NIPS), pages 2208–2216, 2016.
    Google ScholarLocate open access versionFindings
  • Hongseok Namkoong and John C. Duchi. Variance-based regularization with convex objectives. In Advances in Neural Information Processing Systems (NIPS), pages 2975–2984, 2017.
    Google ScholarLocate open access versionFindings
  • Yu Nesterov. Universal gradient methods for convex optimization problems. Mathematical Programming, 152(1):381–404, Aug 2015. ISSN 1436-4646. doi: 10.1007/s10107-014-0790-0. URL https://doi.org/10.1007/s10107-014-0790-0.
    Locate open access versionFindings
  • Atsushi Nitanda and Taiji Suzuki. Stochastic difference of convex algorithm and its application to training deep boltzmann machines. In Artificial Intelligence and Statistics, pages 470–478, 2017.
    Google ScholarLocate open access versionFindings
  • Hassan Rafique, Mingrui Liu, Qihang Lin, and Tianbao Yang. Non-convex min-max optimization: Provable algorithms and applications in machine learning. CoRR, abs/1810.02060, 2018.
    Findings
  • R. Tyrrell Rockafellar and Roger J.-B. Wets. Variational Analysis. Springer Verlag, Heidelberg, Berlin, New York, 1998.
    Google ScholarFindings
  • R Tyrrell Rockafellar and Roger J-B Wets. Variational analysis, volume 3Springer Science & Business Media, 2009.
    Google ScholarFindings
  • Shai Shalev-Shwartz and Yoram Singer. On the equivalence of weak learnability and linear separability: New relaxations and efficient boosting algorithms. Machine learning, 80(2-3): 141–163, 2010.
    Google ScholarLocate open access versionFindings
  • Hoai An Le Thi, Hoai Minh Le, Duy Nhat Phan, and Bach Tran. Stochastic DCA for the large-sum of non-convex functions problem and its application to group variable selection in classification. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3394–3403, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL http://proceedings.mlr.press/v70/thi17a.html.
    Locate open access versionFindings
  • Yi Xu, Qi Qi, Qihang Lin, Rong Jin, and Tianbao Yang. Stochastic optimization for dc functions and non-smooth non-convex regularizers with non-asymptotic convergence. arXiv preprint arXiv:1811.11829, 2018a.
    Findings
  • Yi Xu, Shenghuo Zhu, Sen Yang, Chi Zhang, Rong Jin, and Tianbao Yang. Learning with non-convex truncated losses by SGD. CoRR, abs/1805.07880, 2018b. URL http://arxiv.org/abs/1805.07880.
    Findings
  • Yi Xu, Rong Jin, and Tianbao Yang. Stochastic proximal gradient methods for non-smooth non-convex regularized problems. CoRR, abs/1902.07672, 2019.
    Findings
  • Peilin Zhao and Tong Zhang. Stochastic optimization with importance sampling for regularized loss minimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML), pages 1–9, 2015.
    Google ScholarLocate open access versionFindings
  • Proof. We prove the first part. The second part was proved in Nesterov (2015). Recall that f (x1) − f (x2) ≤
    Google ScholarLocate open access versionFindings
  • Due to Lemma 19 of Shalev-Shwartz and Singer (2010), if φ(y) ≤ ψ(y), then one has φ∗(u) ≥ ψ∗(u) and thus for all u and x, f ∗(u + ∇f (x)) − f ∗(∇f (x)) − x, u ≥ 1 − 1 1+v
    Google ScholarFindings
  • Proof. This analysis is borrowed from the proof of Theorem 2 in Xu et al. (2019). For completeness, we include it here. Let w = (x, y), ∇xf0(t) = ∇xf0(xt, yt), ∇yf0(t) = ∇yf0(xt, yt), ∇f0(t) = (∇xf0(t), ∇yf0(t)) and ∇f0(t) = (∇xf0(t), ∇yf0(t)). By the udpate of xt+1 = ΠX [xt − η∇xf0(t)], we know xt+1 = arg min x∈Rd
    Google ScholarLocate open access versionFindings
  • Next, by Exercise 8.8 and Theorem 10.1 of (Rockafellar and Wets, 1998), we know from the updates of xt+1 and yt+1 that
    Google ScholarFindings
  • =∇xf (x, y) + ∇ (x) (y − y∗(x)), where ∇ (x) is the Jacobian matrix of at x, and y∗(x) = arg miny∈dom(h) h(y) − y, (x) = arg miny∈dom(h) f (x, y). Here y∗(x) is unique given x, since uniform convexity ensures the unique solution (∇h∗ is Holder continuous so that h is uniformly convex). Equality 1 above is due to Theorem 10.58 of Rockafellar and Wets (2009) and unique y∗(x).
    Google ScholarLocate open access versionFindings
  • Proof. This proof is similar to the proof of Proposition 2 in Xu et al. (2018a). For completeness, we include it here.
    Google ScholarLocate open access versionFindings
  • Smooth Case. When f (z) is L-smooth and R(z) is γ-stronglly convex, we then first have the following lemma from (Zhao and Zhang, 2015).
    Google ScholarFindings
  • Zhang, 2015). Its proof can be found in the analysis of Lemma 7 in Xu et al. (2018a).
    Google ScholarLocate open access versionFindings
  • 2. By the optimality condition of zt+1 and the strong convexity of above objective function, we know for any z ∈ Ω,
    Google ScholarFindings
  • 2. Taking expectation on both sides of above inequality and using the convexity of f (z), then we get
    Google ScholarLocate open access versionFindings
  • 2. The let us consider the first term, we have =∇F (xk) + γ(vk − xk) + ∇g(vk) − ∇g(xk) + ∇ (xk)(y∗(xk) − yk), where y∗(xk) = arg miny∈dom(h) h(y) − (xk) y. Let fyk(y) = h(y) − (xk) y. The second equality is due to Theorem 10.13 of Rockafellar and Wets (2009) and the uniqueness of y∗(xk) (h is uniformly convex).
    Google ScholarLocate open access versionFindings
  • The second equality is due to Theorem 10.13 of Rockafellar and Wets (2009) and the uniqueness of y∗(vk) (h is uniformly convex).
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments