## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# A Continuous-Time Mirror Descent Approach to Sparse Phase Retrieval

NIPS 2020, (2020)

EI

Keywords

Abstract

We analyze continuous-time mirror descent applied to sparse phase retrieval, which is the problem of recovering sparse signals from a set of magnitude-only measurements. We apply mirror descent to the unconstrained empirical risk minimization problem (batch setting), using the square loss and square measurements. We provide a convergenc...More

Introduction

- Mirror descent [39] is becoming increasingly popular in a variety of settings in optimization and machine learning.
- [20, 18, 23, 24, 32, 33, 37, 61, 63, 64], and the authors contribute to this literature by analyzing continuous-time mirror descent in the non-convex problem of sparse phase retrieval.
- Theorem 2, the initial Bregman divergence linear convergence stage corresponds to variables Xi(t) on the support being fitted; to establish linear convergence, the authors crucially use the bound (8) of Lemma 1 along with the fact that the second term XSc(t) 1 is negligibly small compared to XS (t) − xS 22.

Highlights

- Mirror descent [39] is becoming increasingly popular in a variety of settings in optimization and machine learning
- While the variational coherence property as defined in [63, 64] precludes the existence of saddle points and is not satisfied in the sparse phase retrieval problem, we show that the defining inequality is satisfied along the trajectory of mirror descent, which is what allows us to establish the convergence analysis
- We provided a convergence analysis of continuous-time mirror descent applied to sparse phase retrieval
- We proved that, equipped with the h√ypentropy mirror map, mirror descent recovers any k-sparse signal x ∈ Rn with xmin = Ω(1/ k) from O(k2) Gaussian measurements
- As Hadamard Wirtinger flow (HWF) can be recovered as a discrete-time first-order approximation to the mirror descent algorithm we analyzed, our results provide a principled theoretical understanding of HWF
- Our continuous-time analysis suggests how the initialization size in HWF affects convergence, and that choosing the initialization size sufficiently small can result in far fewer iterations being necessary to reach any given precision > 0

Results

- Scaling with signal magnitude When analyzing the convergence speed of continuous-time mirror descent equipped with the hypentropy mirror map for sparse phase retrieval, t x
- When considering the algorithm in discrete time, this suggests that the step size should scale like x similar observations has been made in the case of gradient descent for phase retrieval [36], where the step size scales as x
- Similar to the discrete case [25], a brief computation shows that the exponentiated gradient algorithm EG± (17) with initialization (18) is equivalent to mirror descent (15) with initialization
- Theorem 2 implies that the precision up to which convergence is linear is controlled by the mirror map parameter β or, equivalently, by the initialization size in HWF.
- The authors provided a convergence analysis of continuous-time mirror descent applied to sparse phase retrieval.
- The authors' continuous-time analysis suggests how the initialization size in HWF affects convergence, and that choosing the initialization size sufficiently small can result in far fewer iterations being necessary to reach any given precision > 0.
- Implicit regularization in nonconvex statistical estimation: Gradient descent converges linearly for phase retrieval and matrix completion.
- In Appendix A, the authors show the equivalence between continuous-time mirror descent equipped with the hypentropy mirror map and the exponentiated gradient algorithm described in Section 5.
- The authors provide three supporting lemmas characterizing the behavior of mirror descent, which will be useful to prove Theorem 2.
- Guided by the analysis of the population dynamics and the fact that Lemma 1 plays a central role in bounding (29) in terms of the Bregman divergence DΦ(x , X(t)), the authors divide the analysis of the convergence of mirror descent into two stages, bounded by

Conclusion

- The first sum is bounded by Lemma 10: recalling m ≥ c1(γ)k2 log2 n, the authors have with probability at least 1 − c4n−13, max l∈/S
- The other terms can all be bounded as follows: 4 m m (ATj,S xS )3(ATj,Sc xSc )
- The authors leave a full theoretical investigation of HWF, with a proper discussion on step-size tuning, for future work

Summary

- Mirror descent [39] is becoming increasingly popular in a variety of settings in optimization and machine learning.
- [20, 18, 23, 24, 32, 33, 37, 61, 63, 64], and the authors contribute to this literature by analyzing continuous-time mirror descent in the non-convex problem of sparse phase retrieval.
- Theorem 2, the initial Bregman divergence linear convergence stage corresponds to variables Xi(t) on the support being fitted; to establish linear convergence, the authors crucially use the bound (8) of Lemma 1 along with the fact that the second term XSc(t) 1 is negligibly small compared to XS (t) − xS 22.
- Scaling with signal magnitude When analyzing the convergence speed of continuous-time mirror descent equipped with the hypentropy mirror map for sparse phase retrieval, t x
- When considering the algorithm in discrete time, this suggests that the step size should scale like x similar observations has been made in the case of gradient descent for phase retrieval [36], where the step size scales as x
- Similar to the discrete case [25], a brief computation shows that the exponentiated gradient algorithm EG± (17) with initialization (18) is equivalent to mirror descent (15) with initialization
- Theorem 2 implies that the precision up to which convergence is linear is controlled by the mirror map parameter β or, equivalently, by the initialization size in HWF.
- The authors provided a convergence analysis of continuous-time mirror descent applied to sparse phase retrieval.
- The authors' continuous-time analysis suggests how the initialization size in HWF affects convergence, and that choosing the initialization size sufficiently small can result in far fewer iterations being necessary to reach any given precision > 0.
- Implicit regularization in nonconvex statistical estimation: Gradient descent converges linearly for phase retrieval and matrix completion.
- In Appendix A, the authors show the equivalence between continuous-time mirror descent equipped with the hypentropy mirror map and the exponentiated gradient algorithm described in Section 5.
- The authors provide three supporting lemmas characterizing the behavior of mirror descent, which will be useful to prove Theorem 2.
- Guided by the analysis of the population dynamics and the fact that Lemma 1 plays a central role in bounding (29) in terms of the Bregman divergence DΦ(x , X(t)), the authors divide the analysis of the convergence of mirror descent into two stages, bounded by
- The first sum is bounded by Lemma 10: recalling m ≥ c1(γ)k2 log2 n, the authors have with probability at least 1 − c4n−13, max l∈/S
- The other terms can all be bounded as follows: 4 m m (ATj,S xS )3(ATj,Sc xSc )
- The authors leave a full theoretical investigation of HWF, with a proper discussion on step-size tuning, for future work

Funding

- Fan Wu is supported by the EPSRC and MRC through the OxWaSP CDT programme (EP/L016710/1)

Reference

- A. Ali, J. Z. Kolter, and R. J. Tibshirani. A continuous-time view of early stopping for least squares. In International Conference on Artificial Intelligence and Statistics, pages 1370–1378, 2019.
- A. Ali, E. Dobriban, and R. J. Tibshirani. The implicit regularization of stochastic gradient flow for least squares. arXiv preprint arXiv:2003.07802, 2020.
- E. Amid and M. K. Warmuth. Interpolating between gradient descent and exponentiated gradient using reparametrized gradient descent. arXiv preprint arXiv:2002.10487, 2020.
- S. Arora, N. Cohen, W. Hu, and Y. Luo. Implicit regularization in deep matrix factorization. In Advances in Neural Information Processing Systems, pages 7411–7422, 2019.
- J.-Y. Audibert and S. Bubeck. Regret bounds and minimax policies under partial monitoring. Journal of Machine Learning Research, 11(94):2785–2836, 2010.
- J.-Y. Audibert, S. Bubeck, and G. Lugosi. Regret in online combinatorial optimization. Mathematics of Operations Research, 39(1):31–45, 2013.
- A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167–175, 2003.
- S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, 2004.
- S. Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8:231–358, 2015.
- O. Bunk, A. Diaz, F. Pfeiffer, C. David, B. Schmitt, D. K. Satapathy, and J. F. Veen. Diffractive imaging for periodic samples: Retrieving one-dimensional concenctration profiles across microfluidic channels. Acta Crystallographica Section A: Foundations of Crystallography, 63(4):306–314, 2007.
- T. Cai, X. Li, and Z. Ma. Optimal rates of convergence for noisy sparse phase retrieval via thresholded Wirtinger flow. Annals of Statistics, 44(5):2221–2251, 2016.
- E. J. Candès, X. Li, and M. Soltanolkotabi. Phase retrieval via Wirtinger flow: Theory and algorithms. IEEE Transactions on Information Theory, 61(4):1985–2007, 2015.
- G. Chen and M. Teboulle. Convergence analysis of a proximal-like minimization algorithm using bregman functions. SIAM Journal on Mathematical Analysis, 3(3):538–543, 1993.
- Y. Chen and E. J. Candès. Solving random quadratic systems of equations is nearly as easy as solving linear systems. In Advances in Neural Information Processing Systems, pages 739–747, 2015.
- L. Chizat and F. Bach. On the global convergence of gradient descent for over-parametrized models using optimal transport. In Advances in Neural Information Processing Systems, pages 3036–3046, 2018.
- F. Chung and L. Lu. Concentration inequalities and martingale inequalities: a survey. Internet Math., 3(1):79–127, 2006.
- J. V. Corbett. The pauli problem, state reconstruction and quantum real numbers. Reports on Mathematical Physics, 57(1):53–68, 2006.
- C. D. Dang and G. Lan. Stochastic block mirror descent methods for nonsmooth and stochastic optimization. SIAM Journal on Optimization, 25(2):856–881, 2015. on weakly convex functions. arXiv preprint arXiv:1802.02988, 2018.
- [20] D. Davis and B. Grimmer. Proximally guided stochastic subgradient method for nonsmooth, nonconvex problems. SIAM Journal on Optimization, 29(3):1908–1930, 2019.
- [21] J. R. Fienup. Phase retrieval algorithms: A comparison. Applied Optics, 21(15):2758–2769, 1982.
- [22] D. J. Fresen. Variations and extensions of the gaussian concentration inequality. arXiv preprint arXiv:1812.10938, 2018.
- [23] S. Ghadimi and G. Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
- [24] S. Ghadimi, G. Lan, and H. Zhang. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155(1–2):267– 305, 2016.
- [25] U. Ghai, E. Hazan, and Y. Singer. Exponentiated gradient meets gradient descent. In International Conference on Algorithmic Learning Theory, pages 386–407, 2020.
- [26] S. Gunasekar, B. Woodworth, S. Bhojanapalli, B. Neyshabur, and N. Srebro. Implicit regularization in matrix factorization. In Advances in Neural Information Processing Systems, pages 6151–6159, 2017.
- [27] S. Gunasekar, B. Woodworth, and N. Srebro. Mirrorless mirror descent: a more natural discretization of riemannian gradient flow. arXiv preprint arXiv:2004.01025, 2020.
- [28] P. Hand and V. Voroninski. Compressed sensing from phaseless Gaussian measurements via linear programming in the natural parameter spaces. arXiv preprint arXiv:1611.05985, 2016.
- [29] P. D. Hoff. Lasso, fractional norm and structured sparse estimation using a Hadamard product parametrization. Computational Statistics & Data Analysis, 115:186–198, 2017.
- [30] K. Jaganathan, Y. C. Eldar, and B. Hassibi. Phase retrieval: An overview of recent developments. In A. Stern, editor, Optical Compressive Imaging, chapter 13, pages 263–296. Taylor Francis Group, Boca Raton, FL, 2016.
- [31] J. Kivinen and M. K. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1):1–63, 1997.
- [32] W. Kotłlowski and G. Neu. Bandit principal component analysis. arXiv preprint arXiv:1902.03035, 2019.
- [33] W. Krichene, M. Balandat, C. Tomlin, and A. Bayen. The hedge algorithm on a continuum. In International Conference on Machine Learning, pages 824–832, 2015.
- [34] X. Li and V. Voroninski. Sparse signal recovery from quadratic measurements via convex programming. SIAM Journal on Mathematical Analysis, 45(5):3019–3033, 2013.
- [35] Y. Li, T. Ma, and H. Zhang. Algorithmic regularization in over-parametrized matrix sensing and neural networks with quadratic activation. In Conference on Learning Theory, pages 2–47, 2018.
- [36] C. Ma, K. Wang, Y. Chi, and Y. Chen. Implicit regularization in nonconvex statistical estimation: Gradient descent converges linearly for phase retrieval and matrix completion. In International Conference on Machine Learning, pages 3345–3354, 2018.
- [37] O.-A. Maillard and R. Munos. Online learning in adversarial lipschitz environments. Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 305–320, 2010.
- [38] S. Mei, T. Misiakiewicz, and A. Montanari. Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. In Conference on Learning Theory, pages 1–77, 2019.
- [39] A. Nemirovski and D. B. Yudin. Problem Complexity and Method Efficiency in Optimization. Wiley, New York, 1983.
- [40] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.
- [41] P. Netrapalli, P. Jain, and S. Sanghavi. Phase retrieval using alternating minimization. IEEE Transactions on Signal Processing, 63(18):4814–4826, 2015.
- [42] H. Ohlsson, A. Y. Yang, R. Dong, and S. S. Sastry. CPRL–an extension of compressive sensing to the phase retrieval problem. In Advances in Neural Information Processing Systems, pages 1367–1375, 2012.
- [43] D. W. Peterson. A review of constraint qualifications in finite-dimensional spaces. SIAM Review, 15(3):639–654, 1973.
- [44] M. Raginsky and J. Bouvrie. Continuous-time stochastic mirror descent on a network: variance reduction, consensus, convergence. In IEEE Conference on Decision and Control, pages 6793–6800, 2012.
- [45] G. M. Rotskoff and E. Vanden-Eijnden. Trainability and accuracy of neural networks: an interacting particle system approach. arXiv preprint arXiv:1805.00915, 2018.
- [46] Y. Schechtman, A. Beck, and Y. C. Eldar. GESPAR: Efficient phase retrieval of sparse signals. IEEE Transactions on Signal Processing, 62(4):928–938, 2014.
- [47] P. Schniter and S. Rangan. Compressive phase retrieval via generalized approximate message passing. IEEE Transactions on Signal Processing, 63(4):1043–1055, 2015.
- [48] S. Shalev-Schwartz. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4:107–194, 2015.
- [49] J. Sirignano and K. Spiliopoulos. DGM: A deep learning algorithm for solving partial differential equations. Journal of Computational Physics, 375:1339–1364, 2018.
- [50] A. Suggala, A. Prasad, and P. K. Ravikumar. Connecting optimization and regularization paths. In Advances in Neural Information Processing Systems, pages 10608–10619, 2018.
- [51] T. Vaškevičius, V. Kanade, and P. Rebeschini. Implicit regularization for optimal sparse recovery. In Advances in Neural Information Processing Systems, pages 2968–2979, 2019.
- [52] T. Vaškevičius, V. Kanade, and P. Rebeschini. The statistical complexity of early stopper mirror descent. arXiv preprint arXiv:2002.00189, 2020.
- [53] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Y. Eldar and G. Kutyniok, editors, Compressed Sensing, Theory and Applications, chapter 5, pages 210–268. Cambridge University Press, Cambridge, 2012.
- [54] A. Walther. The question of phase retrieval in optics. Optica Acta, 10(1):41–49, 1963.
- [55] G. Wang, G. B. Giannakis, and Y. C. Eldar. Solving systems of random quadratic equations via truncated amplitude flow. IEEE Transactions on Information Theory, 64(2):773–794, 2017.
- [56] G. Wang, L. Zhang, G. B. Giannakis, M. Akçakaya, and J. Chen. Sparse phase retrieval via truncated amplitude flow. IEEE Transactions on Signal Processing, 66(2):479–491, 2018.
- [57] M. K. Warmuth and A. Jagota. Continuous and discrete time nonlinear gradient descent: relative loss bounds and convergence. In Electronic Proceedings of Fifth International Symposium on Artificial Intelligence and Mathematics, 1998.
- [58] F. Wu and P. Rebeschini. Hadamard Wirtinger flow for sparse phase retrieval. arXiv preprint arXiv:2006.01065, 2020.
- [59] Z. Yuan, H. Wang, and Q. Wang. Phase retrieval via sparse Wirtinger flow. Journal of Computational and Applied Mathematics, 355:162–173, 2019.
- [60] L. Zhang, G. Wang, G. B. Giannakis, and J. Chen. Compressive phase retrieval via reweighted amplitude flow. IEEE Transactions on Signal Processing, 66(19):5029–5040, 2018.
- [61] S. Zhang and N. He. On the convergence rate of stochastic mirror descent for nonsmooth nonconvex optimization. arXiv preprint arXiv:1806.04781, 2018.
- [62] P. Zhao, Y. Yang, and Q.-C. He. Implicit regularization via Hadamard product overparametrization in high-dimensional linear regression. arXiv preprint arXiv:1903.09367, 2019.
- [63] Z. Zhou, P. Mertikopoulos, N. Bambos, S. P. Boyd, and P. W. Glynn. Stochastic mirror descent in variationally coherent optimization problems. In Advances in Neural Information Processing Systems, pages 7040–7049, 2017.
- [64] Z. Zhou, P. Mertikopoulos, N. Bambos, S. P. Boyd, and P. W. Glynn. On the convergence of mirror descent beyond stochastic convex programming. SIAM Journal on Optimization, 30(1):687–716, 2020.
- 2. Allowing different values for α shows that, as remarked in Section 4, the dependence δ = Ω(n β)
- 1. As before, Lemma 3 gives
- 0. The analogous result holds for coordinates i
- 1. The analogous result holds for Xj(t) < 0, which shows that, for any j ∈/ S,
- 1. As this holds for every j ∈/ S, we have, for all t ≤ T1, 1+α
- 22. In particular, we have DΦ(x, X(T1)) ≤ 6 c k, where we used
- 5. With this, inequality
- 3. We decompose ∇F (w)i in a straightforward, albeit somewhat lengthy manner. We have
- 4. We begin by showing the bound γ x√S 1 k
- 0. In order to show that d dt
- 11. Putting this together, we can bound the first sum in (61), m

Tags

Comments