## AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically

Go Generating

## AI Traceability

AI parses the academic lineage of this thesis

Generate MRT

## AI Insight

AI extracts a summary of this paper

Weibo:
The component function is a non-convex function which is less sensitive to the residual than the least square loss

# Can Stochastic Zeroth-Order Frank-Wolfe Method Converge Faster for Non-Convex Problems?

ICML, pp.3377-3386, (2020)

Cited by: 2|Views65
EI
Full Text
Bibtex
Weibo

Abstract

Frank-Wolfe algorithm is an efficient method for optimizing non-convex constrained problems. However, most of existing methods focus on the first-order case. In real-world applications, the gradient is not always available. To address the problem of lacking gradient in many applications, we propose two new stochastic zerothorder Frank-Wol...More

Code:

Data:

0
Introduction
• The authors consider the following constrained finitesum minimization problem: fi(x) , (1) x2⌦ n i=1

where ⌦ ⇢ Rd denotes a closed convex feasible set, each component function fi is smooth and non-convex, and n represents the number of component functions.
• A representative example is the robust low-rank matrix completion problem, which is defined as follows: X⇣.
• Where O denotes the observed elements, is a hyperparameter, and kXk⇤ R stands for the low-rank constraint.
• Compared with the non-constraint finite-sum minimization problem, optimizing Eq (1) has to deal with the constraint, which introduces new challenges.
• A straightforward method to optimize the large-scale Eq (1) is the projected gradient descent method which first takes a step along the gradient direction and performs the projection to satisfy the constraint.
• Frank-Wolfe method has been popularly used in optimizing Eq (1)
Highlights
• In this paper, we consider the following constrained finitesum minimization problem: fi(x), (1) x2⌦ n i=1

where ⌦ ⇢ Rd denotes a closed convex feasible set, each component function fi is smooth and non-convex, and n represents the number of component functions
• The component function is a non-convex function which is less sensitive to the residual than the least square loss
• A straightforward method to optimize the large-scale Eq (1) is the projected gradient descent method which first takes a step along the gradient direction and performs the projection to satisfy the constraint
• Unlike the projected gradient descent method, Frank-Wolfe method (Frank & Wolfe, 1956) is more efficient when dealing with the constraint
• We propose a new faster conditional gradient sliding (FCGS) method in Algorithm 4
• We focus on the non-convex maximum correntropy criterion induced regression (MCCR) (Feng et al, 2015) model as follows: min
Methods
• The authors focus on the non-convex maximum correntropy criterion induced regression (MCCR) (Feng et al, 2015) model as follows: ⇣ 21.
• |xk1 s n i=1 n exp2 o⌘ (27).
• Where and s are hyper-parameters.
• As for the experiment for zeroth-order methods, the authors view the loss function as a black-box function, which means that only function value is available.
• As for the experiment for first-order methods, both function value and gradient are available.
Results
• Zeroth-Order Method The convergence result of the zeroth-order method is reported in Figure 1(a) and 1(b).
• It can be found that the proposed methods outperform the baseline method significantly.
• FZFW converges faster than ZSCG.
• FZFW utilizes a variance reduced gradient estimator while ZSCG not.
• The authors' proposed FZFW can converge faster than ZSCG.
• The proposed FZCSG can outperform FZFW.
• The reason is that FZCSG incorporates the acceleration technique
Conclusion
• The authors improved the convergence rate of stochastic zeroth-order Frank-Wolfe method.
• The authors proposed two algorithms for the zeroth-order Frank-Wolfe methods.
• Both of them improve the function queries oracle significantly over existing methods.
• The authors improved the accelerated stochastic zeroth-order FrankWolfe method to a better IFO.
• Experimental results have confirmed the effectiveness of the proposed methods
Summary
• ## Introduction:

The authors consider the following constrained finitesum minimization problem: fi(x) , (1) x2⌦ n i=1

where ⌦ ⇢ Rd denotes a closed convex feasible set, each component function fi is smooth and non-convex, and n represents the number of component functions.
• A representative example is the robust low-rank matrix completion problem, which is defined as follows: X⇣.
• Where O denotes the observed elements, is a hyperparameter, and kXk⇤ R stands for the low-rank constraint.
• Compared with the non-constraint finite-sum minimization problem, optimizing Eq (1) has to deal with the constraint, which introduces new challenges.
• A straightforward method to optimize the large-scale Eq (1) is the projected gradient descent method which first takes a step along the gradient direction and performs the projection to satisfy the constraint.
• Frank-Wolfe method has been popularly used in optimizing Eq (1)
• ## Methods:

The authors focus on the non-convex maximum correntropy criterion induced regression (MCCR) (Feng et al, 2015) model as follows: ⇣ 21.
• |xk1 s n i=1 n exp2 o⌘ (27).
• Where and s are hyper-parameters.
• As for the experiment for zeroth-order methods, the authors view the loss function as a black-box function, which means that only function value is available.
• As for the experiment for first-order methods, both function value and gradient are available.
• ## Results:

Zeroth-Order Method The convergence result of the zeroth-order method is reported in Figure 1(a) and 1(b).
• It can be found that the proposed methods outperform the baseline method significantly.
• FZFW converges faster than ZSCG.
• FZFW utilizes a variance reduced gradient estimator while ZSCG not.
• The authors' proposed FZFW can converge faster than ZSCG.
• The proposed FZCSG can outperform FZFW.
• The reason is that FZCSG incorporates the acceleration technique
• ## Conclusion:

The authors improved the convergence rate of stochastic zeroth-order Frank-Wolfe method.
• The authors proposed two algorithms for the zeroth-order Frank-Wolfe methods.
• Both of them improve the function queries oracle significantly over existing methods.
• The authors improved the accelerated stochastic zeroth-order FrankWolfe method to a better IFO.
• Experimental results have confirmed the effectiveness of the proposed methods
Tables
• Table1: Convergence rate of different zeroth-order algorithms
• Table2: Convergence rate of different first-order conditional gradient sliding algorithms
Funding
• This work was partially supported by U.S NSF IIS 1836945, IIS 1836938, IIS 1845666, IIS 1852606, IIS 1838627, IIS 1837956
Reference
• Clarkson, K. L. Coresets, sparse greedy approximation, and the frank-wolfe algorithm. ACM Transactions on Algorithms (TALG), 6(4):63, 2010.
• Duchi, J. C., Jordan, M. I., Wainwright, M. J., and Wibisono, A. Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory, 61(5):2788–2806, 2015.
• Dvurechensky, P., Gasnikov, A., and Gorbunov, E. An accelerated method for derivative-free smooth stochastic convex optimization. arXiv preprint arXiv:1802.09022, 2018.
• Fang, C., Li, C. J., Lin, Z., and Zhang, T. Spider: Nearoptimal non-convex optimization via stochastic pathintegrated differential estimator. In Advances in Neural Information Processing Systems, pp. 689–699, 2018.
• Feng, Y., Huang, X., Shi, L., Yang, Y., and Suykens, J. A. Learning with the maximum correntropy criterion induced losses for regression. Journal of Machine Learning Research, 16:993–1034, 2015.
• Frank, M. and Wolfe, P. An algorithm for quadratic programming. Naval research logistics quarterly, 3(1-2): 95–110, 1956.
• Gao, X., Jiang, B., and Zhang, S. On the informationadaptive variants of the admm: an iteration complexity perspective. Journal of Scientific Computing, 76(1):327– 363, 2018.
• Ghadimi, S. and Lan, G. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
• Hajinezhad, D., Hong, M., and Garcia, A. Zeroth order nonconvex multi-agent optimization over networks. arXiv preprint arXiv:1710.09997, 2017.
• Hassani, H., Karbasi, A., Mokhtari, A., and Shen, Z. Stochastic conditional gradient++. arXiv preprint arXiv:1902.06992, 2019.
• Hazan, E. and Luo, H. Variance-reduced and projection-free stochastic optimization. In International Conference on Machine Learning, pp. 1263–1271, 2016.
• Ji, K., Wang, Z., Zhou, Y., and Liang, Y. Improved zerothorder variance reduced algorithms and analysis for nonconvex optimization. arXiv preprint arXiv:1910.12166, 2019.
• Lacoste-Julien, S. Convergence rate of frank-wolfe for non-convex objectives. arXiv preprint arXiv:1607.00345, 2016.
• Lacoste-Julien, S. and Jaggi, M. On the global linear convergence of frank-wolfe optimization variants. In Advances in Neural Information Processing Systems, pp. 496–504, 2015.
• Lan, G. and Zhou, Y. Conditional gradient sliding for convex optimization. SIAM Journal on Optimization, 26(2):1379– 1409, 2016.
• Lei, L., Ju, C., Chen, J., and Jordan, M. I. Non-convex finite-sum optimization via scsg methods. In Advances in Neural Information Processing Systems, pp. 2348–2358, 2017.
• Lian, X., Zhang, H., Hsieh, C.-J., Huang, Y., and Liu, J. A comprehensive linear speedup analysis for asynchronous stochastic parallel optimization from zeroth-order to firstorder. In Advances in Neural Information Processing Systems, pp. 3054–3062, 2016.
• Liu, S., Kailkhura, B., Chen, P.-Y., Ting, P., Chang, S., and Amini, L. Zeroth-order stochastic variance reduction for nonconvex optimization. In Advances in Neural Information Processing Systems, pp. 3727–3737, 2018.
• Mokhtari, A., Hassani, H., and Karbasi, A. Stochastic conditional gradient methods: From convex minimization to submodular maximization. arXiv preprint arXiv:1804.09554, 2018.
• Nesterov, Y. and Spokoiny, V. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17(2):527–566, 2017.
• Nguyen, L. M., Liu, J., Scheinberg, K., and Takac, M. Sarah: A novel method for machine learning problems using stochastic recursive gradient. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2613–2621. JMLR. org, 2017.
• Qu, C., Li, Y., and Xu, H. Non-convex conditional gradient sliding. arXiv preprint arXiv:1708.04783, 2017.
• Reddi, S. J., Sra, S., Poczos, B., and Smola, A. Stochastic frank-wolfe methods for nonconvex optimization. In 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 1244–1251. IEEE, 2016.
• Sahu, A. K., Zaheer, M., and Kar, S. Towards gradient free and projection free stochastic optimization. arXiv preprint arXiv:1810.03233, 2018.
• Shamir, O. An optimal algorithm for bandit and zero-order convex optimization with two-point feedback. Journal of Machine Learning Research, 18(52):1–11, 2017.
• Shen, Z., Fang, C., Zhao, P., Huang, J., and Qian, H. Complexities in projection-free stochastic non-convex minimization. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 2868–2876, 2019.
• Wang, Y., Du, S., Balakrishnan, S., and Singh, A. Stochastic zeroth-order optimization in high dimensions. arXiv preprint arXiv:1710.10551, 2017.
• Wang, Z., Ji, K., Zhou, Y., Liang, Y., and Tarokh, V. Spiderboost: A class of faster variance-reduced algorithms for nonconvex optimization. arXiv preprint arXiv:1810.10690, 2018.
• Yurtsever, A., Sra, S., and Cevher, V. Conditional gradient methods via stochastic path-integrated differential estimator. In Proceedings of the International Conference on Machine Learning-ICML 2019, number CONF, 2019.
• Zhang, M., Shen, Z., Mokhtari, A., Hassani, H., and Karbasi, A. One sample stochastic frank-wolfe. arXiv preprint arXiv:1910.04322, 2019.
Author