Gradientless Descent: High-Dimensional Zeroth-Order Optimization

ICLR, 2020.

Cited by: 1|Bibtex|Views15
EI
Other Links: academic.microsoft.com|dblp.uni-trier.de|arxiv.org
Weibo:
GradientLess Descent could be combined with random restarts or other restart policies developed for gradient descent

Abstract:

Zeroth-order optimization is the process of minimizing an objective $f(x)$, given oracle access to evaluations at adaptively chosen inputs $x$. In this paper, we present two simple yet powerful GradientLess Descent (GLD) algorithms that do not rely on an underlying gradient estimate and are numerically stable. We analyze our algorithm fro...More

Code:

Data:

0
Introduction
  • The authors consider the problem of zeroth-order optimization, where the goal is to minimize an objective function f : Rn → R with as few evaluations of f (x) as possible.
  • First-order techniques such as variance reduction (Liu et al, 2018), conditional gradients (Balasubramanian & Ghadimi, 2018), and diagonal preconditioning (Mania et al, 2018) have been successfully adopted in this setting
  • This class of algorithms are known as stochastic search, random search, or evolutionary strategies and have been augmented with a variety of heuristics, such as the popular
Highlights
  • We consider the problem of zeroth-order optimization, where our goal is to minimize an objective function f : Rn → R with as few evaluations of f (x) as possible
  • We present GradientLess Descent (GLD), a class of truly gradient-free algorithms that are parameter free and provably fast
  • We introduced GradientLess Descent, a robust zeroth-order optimization algorithm that is simple, efficient, and we show strong theoretical convergence bounds via our novel geometric analysis
  • It could use momentum terms to keep moving in the same direction that improved the objective, or sample from adaptively chosen ellipsoids to adaptive gradient methods. (Duchi et al, 2011; McMahan & Streeter, 2010)
  • GradientLess Descent could be combined with random restarts or other restart policies developed for gradient descent
  • To adaptive per–coordinate learning rates Duchi et al (2011); McMahan & Streeter (2010), one could adaptively change the shape of the balls being sampled into ellipsoids with various length-scale factors
Methods
  • The authors tested GLD algorithms on a simple class of objective functions and compare it to Accelerated Random Search (ARS) by Nesterov & Spokoiny (2011), which has linear convergence guarantees on strongly convex and strongly smooth functions.
  • The authors' main conclusion is that GLD-Fast is comparable to ARS and tends to achieve a reasonably low error much faster than ARS in high dimensions (≥ 50).
  • GLD-Search is competitive with GLD-Fast and ARS though it requires no information about the function.
Conclusion
  • The authors introduced GLD, a robust zeroth-order optimization algorithm that is simple, efficient, and the authors show strong theoretical convergence bounds via the novel geometric analysis.
  • Just as one may decay or adaptively vary learning rates for gradient descent, one might use a similar change the distribution from which the ball-sampling radii are chosen, perhaps shrinking the minimum radius as the algorithm progresses, or concentrating more probability mass on smaller radii.
  • To adaptive per–coordinate learning rates Duchi et al (2011); McMahan & Streeter (2010), one could adaptively change the shape of the balls being sampled into ellipsoids with various length-scale factors.
Summary
  • Introduction:

    The authors consider the problem of zeroth-order optimization, where the goal is to minimize an objective function f : Rn → R with as few evaluations of f (x) as possible.
  • First-order techniques such as variance reduction (Liu et al, 2018), conditional gradients (Balasubramanian & Ghadimi, 2018), and diagonal preconditioning (Mania et al, 2018) have been successfully adopted in this setting
  • This class of algorithms are known as stochastic search, random search, or evolutionary strategies and have been augmented with a variety of heuristics, such as the popular
  • Methods:

    The authors tested GLD algorithms on a simple class of objective functions and compare it to Accelerated Random Search (ARS) by Nesterov & Spokoiny (2011), which has linear convergence guarantees on strongly convex and strongly smooth functions.
  • The authors' main conclusion is that GLD-Fast is comparable to ARS and tends to achieve a reasonably low error much faster than ARS in high dimensions (≥ 50).
  • GLD-Search is competitive with GLD-Fast and ARS though it requires no information about the function.
  • Conclusion:

    The authors introduced GLD, a robust zeroth-order optimization algorithm that is simple, efficient, and the authors show strong theoretical convergence bounds via the novel geometric analysis.
  • Just as one may decay or adaptively vary learning rates for gradient descent, one might use a similar change the distribution from which the ball-sampling radii are chosen, perhaps shrinking the minimum radius as the algorithm progresses, or concentrating more probability mass on smaller radii.
  • To adaptive per–coordinate learning rates Duchi et al (2011); McMahan & Streeter (2010), one could adaptively change the shape of the balls being sampled into ellipsoids with various length-scale factors.
Tables
  • Table1: Comparison of zeroth order optimization for well-conditioned convex functions where R = x0 − x∗ and F = f (x0) − f (x∗). ‘Monotone’ column indicates the invariance under monotone transformations (Definition 4). ‘k-Sparse’ and ‘k-Affine’ columns indicate that iteration complexity is poly(k, log(n)) when f (x) depends only on a k-sparse subset of coordinates or on a rank-k affine subspace
  • Table2: Final rewards by GLD with linear (L) and deep (H41) policies on Mujoco Benchmarks show that GLD is competitive. We apply an affine projection on HalfCheetah to test affine invariance. We use the reward threshold found from (<a class="ref-link" id="cMania_et+al_2018_a" href="#rMania_et+al_2018_a">Mania et al, 2018</a>) with Reacher’s threshold (<a class="ref-link" id="cSchulman_et+al_2017_a" href="#rSchulman_et+al_2017_a">Schulman et al, 2017</a>) for a reasonable baseline
Download tables as Excel
Funding
  • Presents two simple yet powerful GradientLess Descent algorithms that do not rely on an underlying gradient estimate and are numerically stable
  • Presents GradientLess Descent , a class of truly gradient-free algorithms
  • Presents a novel analysis that relies on facts in high dimensional geometry and can be viewed as a geometric analysis of gradient-free algorithms, recovering the standard convergence rates and step sizes
  • Shows that our fast convergence rates are robust and holds even under the more realistic assumption when f = g(PAx) + h(x) with h(x) being sufficiently small
Reference
  • Kenneth J Arrow and Alain C Enthoven. Quasi-concave programming. Econometrica: Journal of the Econometric Society, pp. 779–800, 1961.
    Google ScholarLocate open access versionFindings
  • Anne Auger and Nikolaus Hansen. A restart cma evolution strategy with increasing population size. In Evolutionary Computation, 2005. The 2005 IEEE Congress on, volume 2, pp. 1769–1776. IEEE, 2005.
    Google ScholarLocate open access versionFindings
  • Krishnakumar Balasubramanian and Saeed Ghadimi. Zeroth-order (non)-convex stochastic optimization via conditional gradient and gradient updates. In Advances in Neural Information Processing Systems, pp. 3455–3464, 2018.
    Google ScholarLocate open access versionFindings
  • Samuel H Brooks. A discussion of random methods for seeking maxima. Operations research, 6 (2):244–251, 1958.
    Google ScholarLocate open access versionFindings
  • Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and Cho-Jui Hsieh. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pp. 15–26. ACM, 2017.
    Google ScholarLocate open access versionFindings
  • Krzysztof Choromanski, Mark Rowland, Vikas Sindhwani, Richard E Turner, and Adrian Weller. Structured evolution with compact architectures for scalable policy optimization. arXiv preprint arXiv:1804.02395, 2018.
    Findings
  • Josip Djolonga, Andreas Krause, and Volkan Cevher. High-dimensional gaussian process bandits. In Advances in Neural Information Processing Systems, pp. 1025–1033, 2013.
    Google ScholarLocate open access versionFindings
  • Mahdi Dodangeh and Luís N Vicente. Worst case complexity of direct search under convexity. Mathematical Programming, 155(1-2):307–332, 2016.
    Google ScholarLocate open access versionFindings
  • John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
    Google ScholarLocate open access versionFindings
  • John C Duchi, Michael I Jordan, Martin J Wainwright, and Andre Wibisono. Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory, 61(5):2788–2806, 2015.
    Google ScholarLocate open access versionFindings
  • Abraham D Flaxman, Adam Tauman Kalai, and H Brendan McMahan. Online convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms, pp. 385–394. Society for Industrial and Applied Mathematics, 2005.
    Google ScholarLocate open access versionFindings
  • Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
    Google ScholarLocate open access versionFindings
  • Eduard Gorbunov, Adel Bibi, Ozan Sener, El Houcine Bergou, and Peter Richtárik. A stochastic derivative free optimization method with momentum. arXiv preprint arXiv:1905.13278, 2019.
    Findings
  • Serge Gratton, Clément W Royer, Luís Nunes Vicente, and Zaikun Zhang. Direct search based on probabilistic descent. SIAM Journal on Optimization, 25(3):1515–1541, 2015.
    Google ScholarLocate open access versionFindings
  • Nikolaus Hansen, Steffen Finck, Raymond Ros, and Anne Auger. Real-Parameter Black-Box Optimization Benchmarking 2009: Noiseless Functions Definitions. Research Report RR-6829, INRIA, 2009.
    Google ScholarLocate open access versionFindings
  • Elad Hazan, Adam Klivans, and Yang Yuan. Hyperparameter optimization: A spectral approach. arXiv preprint arXiv:1706.00764, 2017.
    Findings
  • Shengqiao Li. Concise formulas for the area and volume of a hyperspherical cap. Asian Journal of Mathematics and Statistics, 4(1):66–70, 2011.
    Google ScholarLocate open access versionFindings
  • Sijia Liu, Jie Chen, Pin-Yu Chen, and Alfred O Hero. Zeroth-order online alternating direction method of multipliers: Convergence analysis and applications. arXiv preprint arXiv:1710.07804, 2017.
    Findings
  • Sijia Liu, Bhavya Kailkhura, Pin-Yu Chen, Paishun Ting, Shiyu Chang, and Lisa Amini. Zerothorder stochastic variance reduction for nonconvex optimization. In Advances in Neural Information Processing Systems, pp. 3727–3737, 2018.
    Google ScholarLocate open access versionFindings
  • Horia Mania, Aurelia Guy, and Benjamin Recht. Simple random search provides a competitive approach to reinforcement learning. arXiv preprint arXiv:1803.07055, 2018.
    Findings
  • H. Brendan McMahan and Matthew J. Streeter. Adaptive bound optimization for online convex optimization. In COLT 2010 - The 23rd Conference on Learning Theory, Haifa, Israel, June 27-29, 2010, pp. 244–256, 2010.
    Google ScholarFindings
  • Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions. Technical report, Université catholique de Louvain, Center for Operations Research and Econometrics (CORE), 2011.
    Google ScholarLocate open access versionFindings
  • Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia conference on computer and communications security, pp. 506–519. ACM, 2017.
    Google ScholarLocate open access versionFindings
  • Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864, 2017.
    Findings
  • John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017. URL http://arxiv.org/abs/1707.06347.
    Findings
  • Ohad Shamir. An optimal algorithm for bandit and zero-order convex optimization with two-point feedback. Journal of Machine Learning Research, 18(52):1–11, 2017.
    Google ScholarLocate open access versionFindings
  • Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pp. 2951–2959, 2012.
    Google ScholarLocate open access versionFindings
  • Sebastian U Stich, Christian L Muller, and Bernd Gartner. Optimization of convex functions with random pursuit. SIAM Journal on Optimization, 23(2):1284–1309, 2013.
    Google ScholarLocate open access versionFindings
  • Yining Wang, Simon Du, Sivaraman Balakrishnan, and Aarti Singh. Stochastic zeroth-order optimization in high dimensions. arXiv preprint arXiv:1710.10551, 2017.
    Findings
  • Ziyu Wang, Masrour Zoghi, Frank Hutter, David Matheson, and Nando De Freitas. Bayesian optimization in high dimensions via random embeddings. In Twenty-Third International Joint Conference on Artificial Intelligence, 2013.
    Google ScholarLocate open access versionFindings
  • Since |r1 −r2| ≤ ≤ r1 +r2, the intersection B1 ∩B2 is composed of two hyperspherical caps glued end to end. We lower bound vol (B1 ∩ B2) by the volume of the cap C1 of B1 that is contained in the intersection. Consider the triangle with sides r1, r2 and. From classic geometry, the height of
    Google ScholarFindings
  • The volume of a spherical cap is Li (2011), n+1 1 vol (C1)
    Google ScholarFindings
  • Hence, in order to obtain a lower bound on vol (C1), we want to lower bound
    Google ScholarFindings
  • 12. It follows that c1 r1 Suppose that vol (Bx
    Google ScholarFindings
  • 1 4 vol (Bx)
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments