# EigenGame: PCA as a Nash Equilibrium

international conference on learning representations, 2020.

Weibo:

Abstract:

We present a novel view on principal component analysis (PCA) as a competitive game in which each approximate eigenvector is controlled by a player whose goal is to maximize their own utility function. We analyze the properties of this PCA game and the behavior of its gradient based updates. The resulting algorithm which combines elemen...More

Introduction

- The principal components of data are the vectors that align with the directions of maximum variance.
- Recent methods for principal component analysis (PCA) focus on the latter, explicitly stating objectives to find the k-dimensional subspace that captures maximum variance (e.g., (Tang, 2019)), and leaving the problem of rotating within this subspace to, for example, a more efficient downstream singular value (SVD) decomposition step.
- Recent methods for principal component analysis (PCA) focus on the latter, explicitly stating objectives to find the k-dimensional subspace that captures maximum variance (e.g., (Tang, 2019)), and leaving the problem of rotating within this subspace to, for example, a more efficient downstream singular value (SVD) decomposition step1
- This point is subtle, yet critical.
- Sampling or sketching methods scale well, but again, focus on the top-k subspace (Sarlos, 2006; Cohen et al, 2017; Feldman et al, 2020)

Highlights

- The principal components of data are the vectors that align with the directions of maximum variance
- Any pair of twodimensional, orthogonal vectors spans all of R2 and, captures maximum variance of any two-dimensional dataset. For these vectors to be principal components, they must, in addition, align with the directions of maximum variance which depends on the covariance of the data
- By learning the optimal subspace, rather than the principal components themselves, objectives focused on subspace error ignore the first purpose of principal component analysis (PCA)
- It is well known that the PCA solution of the d-dimensional dataset X 2 Rn⇥d is given by the eigenvectors of X>X or equivalently, the right singular vectors of X

Methods

- M vji results in the following gradient for player i:.
- The resulting gradient with normalized penalty term has an intuitive meaning.
- It consists of a single generalized Gram-Schmidt step followed by the standard matrix product found in power iteration and Oja’s rule.
- Notice that applying the gradient as a fixed point operator in sequence (vi rvi ui (vi |vj

Conclusion

- In this work the authors motivated PCA from the perspective of a multi-player game
- This inspired a decentralized algorithm which enables large-scale principal components estimation.
- To demonstrate this the authors used EigenGame to analyze a large neural network through the lens of PCA.
- To the knowledge this is the first academic analysis of its type and scale (for reference, (Tang, 2019) compute the top-6 PCs of the d = 2300 outputs of VGG).

Summary

## Introduction:

The principal components of data are the vectors that align with the directions of maximum variance.- Recent methods for principal component analysis (PCA) focus on the latter, explicitly stating objectives to find the k-dimensional subspace that captures maximum variance (e.g., (Tang, 2019)), and leaving the problem of rotating within this subspace to, for example, a more efficient downstream singular value (SVD) decomposition step.
- Recent methods for principal component analysis (PCA) focus on the latter, explicitly stating objectives to find the k-dimensional subspace that captures maximum variance (e.g., (Tang, 2019)), and leaving the problem of rotating within this subspace to, for example, a more efficient downstream singular value (SVD) decomposition step1
- This point is subtle, yet critical.
- Sampling or sketching methods scale well, but again, focus on the top-k subspace (Sarlos, 2006; Cohen et al, 2017; Feldman et al, 2020)
## Methods:

M vji results in the following gradient for player i:.- The resulting gradient with normalized penalty term has an intuitive meaning.
- It consists of a single generalized Gram-Schmidt step followed by the standard matrix product found in power iteration and Oja’s rule.
- Notice that applying the gradient as a fixed point operator in sequence (vi rvi ui (vi |vj
## Conclusion:

In this work the authors motivated PCA from the perspective of a multi-player game- This inspired a decentralized algorithm which enables large-scale principal components estimation.
- To demonstrate this the authors used EigenGame to analyze a large neural network through the lens of PCA.
- To the knowledge this is the first academic analysis of its type and scale (for reference, (Tang, 2019) compute the top-6 PCs of the d = 2300 outputs of VGG).

Related work

- PCA is a century-old problem and a massive literature exists (Jolliffe, 2002; Golub and Van Loan, 2012). The standard solution to this problem is to compute the SVD, possibly combined with randomized algorithms, to recover the top-k components as in (Halko et al, 2011) or with Frequent Directions (Ghashami et al, 2016) which combines sketching with SVD.

In neuroscience, Hebb’s rule (Hebb, 2005) refers to a connectionist rule that solves for the top eigenvector of a matrix M using additive updates of a vector v as v v + ⌘M v. Likewise, Oja’s rule (Oja, 1982; Shamir, 2015) refers to a similar update v v + ⌘(I vv>)M v. In machine learning, using a normalization step of v v/||v|| with Hebb’s rule is somewhat confusingly referred to as Oja’s algorithm (Shamir, 2015), the reason being that the subtractive term in Oja’s rule can be viewed as a regularization term for implicitly enforcing the normalization. In the limit of infinite step size, ⌘ ! 1, Oja’s algorithm effectively becomes the well known Power method. If a normalization step is added to Oja’s rule, this is referred to as Krasulina’s algorithm (Krasulina, 1969). In the language of Riemannian manifolds, v/||v|| can be recognized as a retraction and (I vv>) as projecting the gradient M v onto the tangent space of the sphere (Absil et al, 2009).

Study subjects and analysis

samples: 1024

60, 000 ⇥ 784 dimensional matrix. EigenGame is competitive with Oja’s in a high batch size regime (1024 samples per mini-batch). The performance gap between EigenGame and the other methods shrinks as the mini-batch size is reduced (see Appendix I), expectedly due to biased gradients

Reference

- P-A. Absil, Robert Mahony, and Rodolphe Sepulchre. Optimization Algorithms on Matrix Manifolds. Princeton University Press, 2009.
- Zeyuan Allen-Zhu and Yuanzhi Li. First efficient convergence for streaming k-PCA: a global, gap-free, and near-optimal rate. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 487–49IEEE, 2017.
- Ehsan Amid and Manfred K Warmuth. An implicit form of Krasulina’s k-PCA update without the orthonormality constraint. arXiv preprint arXiv:1909.04803, 2019.
- Pierre Baldi and Kurt Hornik. Neural networks and principal component analysis: learning from examples without local minima. Neural Networks, 2(1):53–58, 1989.
- Anthony J Bell and Terrence J Sejnowski. The “independent components” of natural scenes are edge filters. Vision Research, 37(23):3327–3338, 1997.
- Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloe Kiddon, Jakub Konecny, Stefano Mazzocchi, H Brendan McMahan, et al. Towards federated learning at scale: system design. arXiv preprint arXiv:1902.01046, 2019.
- Nicolas Boumal, Pierre-Antoine Absil, and Coralia Cartis. Global rates of convergence for nonconvex optimization on manifolds. IMA Journal of Numerical Analysis, 39(1):1–33, 2019.
- Hervé Bourlard and Yves Kamp. Auto-association by multilayer perceptrons and singular value decomposition. Biological Cybernetics, 59(4-5):291–294, 1988.
- James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, and Skye Wanderman-Milne. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.
- Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2172–2180, 2016.
- Michael B Cohen, Cameron Musco, and Christopher Musco. Input sparsity time low-rank approximation via ridge leverage score sampling. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1758–1777. SIAM, 2017.
- Constantinos Daskalakis, Paul W Goldberg, and Christos H Papadimitriou. The complexity of computing a Nash equilibrium. SIAM Journal on Computing, 39(1):195–259, 2009.
- Guillaume Desjardins, Karen Simonyan, Razvan Pascanu, et al. Natural neural networks. In Advances in Neural Information Processing Systems, pages 2071–2079, 2015.
- Dan Feldman, Melanie Schmidt, and Christian Sohler. Turning big data into tiny data: Constant-size coresets for k-means, PCA, and projective clustering. SIAM Journal on Computing, 49(3):601–657, 2020.
- Arpita Gang, Haroon Raja, and Waheed U Bajwa. Fast and communication-efficient distributed pca. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7450–7454. IEEE, 2019.
- Mina Ghashami, Edo Liberty, Jeff M Phillips, and David P Woodruff. Frequent directions: simple and deterministic matrix sketching. SIAM Journal on Computing, 45(5):1762–1792, 2016.
- Benyamin Ghojogh, Fakhri Karray, and Mark Crowley. Eigenvalue and generalized eigenvalue problems: Tutorial. arXiv preprint arXiv:1903.11240, 2019.
- Itzhak Gilboa and Eitan Zemel. Nash and correlated equilibria: some complexity considerations. Games and Economic Behavior, 1(1):80–93, 1989.
- Gene H Golub and Henk A Van der Vorst. Eigenvalue computation in the 20th century. Journal of Computational and Applied Mathematics, 123(1-2):35–65, 2000.
- Gene H Golub and Charles F Van Loan. Matrix Computations, volume 3. JHU press, 2012.
- Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
- Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217–288, 2011.
- Donald Olding Hebb. The Organization of Behavior: A Neuropsychological Theory. Psychology Press, 2005.
- Christina Heinze, Brian McWilliams, and Nicolai Meinshausen. Dual-loco: distributing statistical estimation using random projections. In Artificial Intelligence and Statistics, pages 875–883, 2016.
- Christina Heinze-Deml, Brian McWilliams, and Nicolai Meinshausen. Preserving privacy between features in distributed estimation. Stat, 7(1):e189, 2018.
- Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. Beta-VAE: learning basic visual concepts with a constrained variational framework. International Conference on Learning Representations, 2(5):6, 2017.
- Springer, 1971. Terence D Sanger. Optimal unsupervised learning in a single-layer linear feedforward neural network. Neural Networks, 2(6):459–473, 1989. Mhd Hasan Sarhan, Abouzar Eslami, Nassir Navab, and Shadi Albarqouni. Learning interpretable disentangled representations using adversarial VAEs. In Domain Adaptation and Representation Transfer and Medical Image Learning with Less Labels and Imperfect Data, pages 37–44.

Tags

Comments