EigenGame: PCA as a Nash Equilibrium

Brian McWilliams
Brian McWilliams
Claire Vernade
Claire Vernade

international conference on learning representations, 2020.

Cited by: 0|Views788
Weibo:
We formulate the solution to PCA as the Nash of a suitable game with accompanying algorithm that we demonstrate on a 200TB dataset.

Abstract:

We present a novel view on principal component analysis (PCA) as a competitive game in which each approximate eigenvector is controlled by a player whose goal is to maximize their own utility function. We analyze the properties of this PCA game and the behavior of its gradient based updates. The resulting algorithm which combines elemen...More
0
Full Text
Bibtex
Weibo
Introduction
  • The principal components of data are the vectors that align with the directions of maximum variance.
  • Recent methods for principal component analysis (PCA) focus on the latter, explicitly stating objectives to find the k-dimensional subspace that captures maximum variance (e.g., (Tang, 2019)), and leaving the problem of rotating within this subspace to, for example, a more efficient downstream singular value (SVD) decomposition step.
  • Recent methods for principal component analysis (PCA) focus on the latter, explicitly stating objectives to find the k-dimensional subspace that captures maximum variance (e.g., (Tang, 2019)), and leaving the problem of rotating within this subspace to, for example, a more efficient downstream singular value (SVD) decomposition step1
  • This point is subtle, yet critical.
  • Sampling or sketching methods scale well, but again, focus on the top-k subspace (Sarlos, 2006; Cohen et al, 2017; Feldman et al, 2020)
Highlights
  • The principal components of data are the vectors that align with the directions of maximum variance
  • Any pair of twodimensional, orthogonal vectors spans all of R2 and, captures maximum variance of any two-dimensional dataset. For these vectors to be principal components, they must, in addition, align with the directions of maximum variance which depends on the covariance of the data
  • By learning the optimal subspace, rather than the principal components themselves, objectives focused on subspace error ignore the first purpose of principal component analysis (PCA)
  • It is well known that the PCA solution of the d-dimensional dataset X 2 Rn⇥d is given by the eigenvectors of X>X or equivalently, the right singular vectors of X
Methods
  • M vji results in the following gradient for player i:.
  • The resulting gradient with normalized penalty term has an intuitive meaning.
  • It consists of a single generalized Gram-Schmidt step followed by the standard matrix product found in power iteration and Oja’s rule.
  • Notice that applying the gradient as a fixed point operator in sequence (vi rvi ui (vi |vj
Conclusion
  • In this work the authors motivated PCA from the perspective of a multi-player game
  • This inspired a decentralized algorithm which enables large-scale principal components estimation.
  • To demonstrate this the authors used EigenGame to analyze a large neural network through the lens of PCA.
  • To the knowledge this is the first academic analysis of its type and scale (for reference, (Tang, 2019) compute the top-6 PCs of the d = 2300 outputs of VGG).
Summary
  • Introduction:

    The principal components of data are the vectors that align with the directions of maximum variance.
  • Recent methods for principal component analysis (PCA) focus on the latter, explicitly stating objectives to find the k-dimensional subspace that captures maximum variance (e.g., (Tang, 2019)), and leaving the problem of rotating within this subspace to, for example, a more efficient downstream singular value (SVD) decomposition step.
  • Recent methods for principal component analysis (PCA) focus on the latter, explicitly stating objectives to find the k-dimensional subspace that captures maximum variance (e.g., (Tang, 2019)), and leaving the problem of rotating within this subspace to, for example, a more efficient downstream singular value (SVD) decomposition step1
  • This point is subtle, yet critical.
  • Sampling or sketching methods scale well, but again, focus on the top-k subspace (Sarlos, 2006; Cohen et al, 2017; Feldman et al, 2020)
  • Methods:

    M vji results in the following gradient for player i:.
  • The resulting gradient with normalized penalty term has an intuitive meaning.
  • It consists of a single generalized Gram-Schmidt step followed by the standard matrix product found in power iteration and Oja’s rule.
  • Notice that applying the gradient as a fixed point operator in sequence (vi rvi ui (vi |vj
  • Conclusion:

    In this work the authors motivated PCA from the perspective of a multi-player game
  • This inspired a decentralized algorithm which enables large-scale principal components estimation.
  • To demonstrate this the authors used EigenGame to analyze a large neural network through the lens of PCA.
  • To the knowledge this is the first academic analysis of its type and scale (for reference, (Tang, 2019) compute the top-6 PCs of the d = 2300 outputs of VGG).
Related work
  • PCA is a century-old problem and a massive literature exists (Jolliffe, 2002; Golub and Van Loan, 2012). The standard solution to this problem is to compute the SVD, possibly combined with randomized algorithms, to recover the top-k components as in (Halko et al, 2011) or with Frequent Directions (Ghashami et al, 2016) which combines sketching with SVD.

    In neuroscience, Hebb’s rule (Hebb, 2005) refers to a connectionist rule that solves for the top eigenvector of a matrix M using additive updates of a vector v as v v + ⌘M v. Likewise, Oja’s rule (Oja, 1982; Shamir, 2015) refers to a similar update v v + ⌘(I vv>)M v. In machine learning, using a normalization step of v v/||v|| with Hebb’s rule is somewhat confusingly referred to as Oja’s algorithm (Shamir, 2015), the reason being that the subtractive term in Oja’s rule can be viewed as a regularization term for implicitly enforcing the normalization. In the limit of infinite step size, ⌘ ! 1, Oja’s algorithm effectively becomes the well known Power method. If a normalization step is added to Oja’s rule, this is referred to as Krasulina’s algorithm (Krasulina, 1969). In the language of Riemannian manifolds, v/||v|| can be recognized as a retraction and (I vv>) as projecting the gradient M v onto the tangent space of the sphere (Absil et al, 2009).
Study subjects and analysis
samples: 1024
60, 000 ⇥ 784 dimensional matrix. EigenGame is competitive with Oja’s in a high batch size regime (1024 samples per mini-batch). The performance gap between EigenGame and the other methods shrinks as the mini-batch size is reduced (see Appendix I), expectedly due to biased gradients

Reference
  • P-A. Absil, Robert Mahony, and Rodolphe Sepulchre. Optimization Algorithms on Matrix Manifolds. Princeton University Press, 2009.
    Google ScholarFindings
  • Zeyuan Allen-Zhu and Yuanzhi Li. First efficient convergence for streaming k-PCA: a global, gap-free, and near-optimal rate. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 487–49IEEE, 2017.
    Google ScholarLocate open access versionFindings
  • Ehsan Amid and Manfred K Warmuth. An implicit form of Krasulina’s k-PCA update without the orthonormality constraint. arXiv preprint arXiv:1909.04803, 2019.
    Findings
  • Pierre Baldi and Kurt Hornik. Neural networks and principal component analysis: learning from examples without local minima. Neural Networks, 2(1):53–58, 1989.
    Google ScholarLocate open access versionFindings
  • Anthony J Bell and Terrence J Sejnowski. The “independent components” of natural scenes are edge filters. Vision Research, 37(23):3327–3338, 1997.
    Google ScholarLocate open access versionFindings
  • Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloe Kiddon, Jakub Konecny, Stefano Mazzocchi, H Brendan McMahan, et al. Towards federated learning at scale: system design. arXiv preprint arXiv:1902.01046, 2019.
    Findings
  • Nicolas Boumal, Pierre-Antoine Absil, and Coralia Cartis. Global rates of convergence for nonconvex optimization on manifolds. IMA Journal of Numerical Analysis, 39(1):1–33, 2019.
    Google ScholarLocate open access versionFindings
  • Hervé Bourlard and Yves Kamp. Auto-association by multilayer perceptrons and singular value decomposition. Biological Cybernetics, 59(4-5):291–294, 1988.
    Google ScholarLocate open access versionFindings
  • James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, and Skye Wanderman-Milne. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.
    Findings
  • Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2172–2180, 2016.
    Google ScholarLocate open access versionFindings
  • Michael B Cohen, Cameron Musco, and Christopher Musco. Input sparsity time low-rank approximation via ridge leverage score sampling. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1758–1777. SIAM, 2017.
    Google ScholarLocate open access versionFindings
  • Constantinos Daskalakis, Paul W Goldberg, and Christos H Papadimitriou. The complexity of computing a Nash equilibrium. SIAM Journal on Computing, 39(1):195–259, 2009.
    Google ScholarLocate open access versionFindings
  • Guillaume Desjardins, Karen Simonyan, Razvan Pascanu, et al. Natural neural networks. In Advances in Neural Information Processing Systems, pages 2071–2079, 2015.
    Google ScholarLocate open access versionFindings
  • Dan Feldman, Melanie Schmidt, and Christian Sohler. Turning big data into tiny data: Constant-size coresets for k-means, PCA, and projective clustering. SIAM Journal on Computing, 49(3):601–657, 2020.
    Google ScholarLocate open access versionFindings
  • Arpita Gang, Haroon Raja, and Waheed U Bajwa. Fast and communication-efficient distributed pca. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7450–7454. IEEE, 2019.
    Google ScholarLocate open access versionFindings
  • Mina Ghashami, Edo Liberty, Jeff M Phillips, and David P Woodruff. Frequent directions: simple and deterministic matrix sketching. SIAM Journal on Computing, 45(5):1762–1792, 2016.
    Google ScholarLocate open access versionFindings
  • Benyamin Ghojogh, Fakhri Karray, and Mark Crowley. Eigenvalue and generalized eigenvalue problems: Tutorial. arXiv preprint arXiv:1903.11240, 2019.
    Findings
  • Itzhak Gilboa and Eitan Zemel. Nash and correlated equilibria: some complexity considerations. Games and Economic Behavior, 1(1):80–93, 1989.
    Google ScholarLocate open access versionFindings
  • Gene H Golub and Henk A Van der Vorst. Eigenvalue computation in the 20th century. Journal of Computational and Applied Mathematics, 123(1-2):35–65, 2000.
    Google ScholarLocate open access versionFindings
  • Gene H Golub and Charles F Van Loan. Matrix Computations, volume 3. JHU press, 2012.
    Google ScholarLocate open access versionFindings
  • Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
    Google ScholarLocate open access versionFindings
  • Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217–288, 2011.
    Google ScholarLocate open access versionFindings
  • Donald Olding Hebb. The Organization of Behavior: A Neuropsychological Theory. Psychology Press, 2005.
    Google ScholarFindings
  • Christina Heinze, Brian McWilliams, and Nicolai Meinshausen. Dual-loco: distributing statistical estimation using random projections. In Artificial Intelligence and Statistics, pages 875–883, 2016.
    Google ScholarLocate open access versionFindings
  • Christina Heinze-Deml, Brian McWilliams, and Nicolai Meinshausen. Preserving privacy between features in distributed estimation. Stat, 7(1):e189, 2018.
    Google ScholarLocate open access versionFindings
  • Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. Beta-VAE: learning basic visual concepts with a constrained variational framework. International Conference on Learning Representations, 2(5):6, 2017.
    Google ScholarLocate open access versionFindings
  • Springer, 1971. Terence D Sanger. Optimal unsupervised learning in a single-layer linear feedforward neural network. Neural Networks, 2(6):459–473, 1989. Mhd Hasan Sarhan, Abouzar Eslami, Nassir Navab, and Shadi Albarqouni. Learning interpretable disentangled representations using adversarial VAEs. In Domain Adaptation and Representation Transfer and Medical Image Learning with Less Labels and Imperfect Data, pages 37–44.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments