## AI帮你理解科学

## AI 精读

AI抽取本论文的概要总结

微博一下：

# Tensor decompositions for learning latent variable models

Journal of Machine Learning Research, no. 1 (2014): 2773-2832

EI

摘要

This work considers a computationally and statistically efficient parameter estimation method for a wide class of latent variable models--including Gaussian mixture models, hidden Markov models, and latent Dirichlet allocation--which exploits a certain tensor structure in their low-order observable moments (typically, of second- and third...更多

代码：

数据：

简介

- The method of moments is a classical parameter estimation technique (Pearson, 1894) from statistics which has proved invaluable in a number of application domains.
- The primary difficulty in learning latent variable models is that the latent state of the data is not directly observed; rather only observed variables correlated with the hidden state are observed
- As such, it is not evident the method of moments should fare any better than maximum likelihood in terms of computational performance: matching the model parameters to the observed moments may involve solving computationally intractable systems of multivariate polynomial equations.
- What is more is that these decomposition problems are often amenable to simple and efficient iterative methods, such as gradient descent and the power iteration method

重点内容

- The method of moments is a classical parameter estimation technique (Pearson, 1894) from statistics which has proved invaluable in a number of application domains
- In a number of cases, the method of moments leads to consistent estimators which can be efficiently computed; this is especially relevant in the context of latent variable models, where standard maximum likelihood approaches are typically computationally prohibitive, and heuristic methods can be unreliable and difficult to validate with high-dimensional data
- The method of moments can be viewed as complementary to the maximum likelihood approach; taking a single step of Newton-Raphson on the likelihood function starting from the moment based estimator (Le Cam, 1986) often leads to the best of both worlds: a computationally efficient estimator that is statistically optimal
- We discuss some practical and application-oriented issues related to the tensor decomposition approach to learning latent variable models
- A number of practical concerns arise when dealing with moment matrices and tensors
- The estimators obtained via Theorem 3.1 and Theorem 3.5 (LDA) use only up to third-order moments, which suggests that each document only needs to have three words

结论

- The authors discuss some practical and application-oriented issues related to the tensor decomposition approach to learning latent variable models.

6.1 Practical Implementation Considerations

A number of practical concerns arise when dealing with moment matrices and tensors. - X in a document are conditionally i.i.d. given the topic h
- This allows one to estimate p-th order moments using just p words per document.
- One should use all of the words in a document for efficient estimation of the moments.
- Ordered triples of words in a document of length
- At first blush, this seems computationally expensive, but as it turns out, the averaging can be done implicitly, as shown by Zou et al (2013)

相关工作

- The connection between tensor decompositions and latent variable models has a long history across many scientific and mathematical disciplines. We review some of the key works that are most closely related to ours.

1.2.1 Tensor Decompositions

The role of tensor decompositions in the context of latent variable models dates back to early uses in psychometrics (Cattell, 1944). These ideas later gained popularity in chemometrics, and more recently in numerous science and engineering disciplines, including neuroscience, phylogenetics, signal processing, data mining, and computer vision. A thorough survey of these techniques and applications is given by Kolda and Bader (2009). Below, we discuss a few specific connections to two applications in machine learning and statistics, independent component analysis and latent variable models (between which there is also significant overlap).

Tensor decompositions have been used in signal processing and computational neuroscience for blind source separation and independent component analysis (ICA) (Comon and Jutten, 2010). Here, statistically independent non-Gaussian sources are linearly mixed in the observed signal, and the goal is to recover the mixing matrix (and ultimately, the original source signals). A typical solution is to locate projections of the observed signals that correspond to local extrema of the so-called “contrast functions” which distinguish Gaussian variables from non-Gaussian variables. This method can be effectively implemented using fast descent algorithms (Hyvarinen, 1999). When using the excess kurtosis (i.e., fourth-order cumulant) as the contrast function, this method reduces to a generalization of the power method for symmetric tensors (Lathauwer et al, 2000; Zhang and Golub, 2001; Kofidis and Regalia, 2002). This case is particularly important, since all local extrema of the kurtosis objective correspond to the true sources (under the assumed statistical model) (Delfosse and Loubaton, 1995); the descent methods can therefore be rigorously analyzed, and their computational and statistical complexity can be bounded (Frieze et al, 1996; Nguyen and Regev, 2009; Arora et al, 2012b).

基金

- AA is supported in part by the NSF Award CCF-1219234, AFOSR Award FA9550-10-1-0310 and the ARO Award W911NF-12-1-0404

引用论文

- D. Achlioptas and F. McSherry. On spectral learning of mixtures of distributions. In Eighteenth Annual Conference on Learning Theory, pages 458–469, 2005.
- E. S. Allman, C. Matias, and J. A. Rhodes. Identifiability of parameters in latent structure models with many observed variables. The Annals of Statistics, 37(6A):3099–3132, 2009.
- A. Anandkumar, D. P. Foster, D. Hsu, S. M. Kakade, and Y.-K. Liu. A spectral algorithm for latent Dirichlet allocation. In Advances in Neural Information Processing Systems 25, 2012a.
- A. Anandkumar, D. Hsu, F. Huang, and S. M. Kakade. Learning mixtures of tree graphical models. In Advances in Neural Information Processing Systems 25, 2012b.
- A. Anandkumar, D. Hsu, and S. M. Kakade. A method of moments for mixture models and hidden Markov models. In Twenty-Fifth Annual Conference on Learning Theory, volume 23, pages 33.1–33.34, 2012c.
- J. Anderson, M. Belkin, N. Goyal, L. Rademacher, and J. Voss. The more, the merrier: the blessing of dimensionality for learning large Gaussian mixtures. In Twenty-Seventh Annual Conference on Learning Theory, 2014.
- S. Arora and R. Kannan. Learning mixtures of separated nonspherical Gaussians. The Annals of Applied Probability, 15(1A):69–92, 2005.
- S. Arora, R. Ge, and A. Moitra. Learning topic models — going beyond SVD. In Fifty-Third IEEE Annual Symposium on Foundations of Computer Science, pages 1–10, 2012a.
- S. Arora, R. Ge, A. Moitra, and S. Sachdeva. Provable ICA with unknown Gaussian noise, and implications for Gaussian mixtures and autoencoders. In Advances in Neural Information Processing Systems 25, 2012b.
- T. Austin. On exchangeable random variables and the statistics of large graphs and hypergraphs. Probab. Survey, 5:80–145, 2008.
- R. Bailly. Quadratic weighted automata: Spectral algorithm and likelihood maximization. Journal of Machine Learning Research, 2011.
- B. Balle and M. Mohri. Spectral learning of general weighted automata via constrained matrix completion. In Advances in Neural Information Processing Systems 25, 2012.
- B. Balle, A. Quattoni, and X. Carreras. Local loss optimization in operator models: A new insight into spectral learning. In Twenty-Ninth International Conference on Machine Learning, 2012.
- M. Belkin and K. Sinha. Polynomial learning of distribution families. In Fifty-First Annual IEEE Symposium on Foundations of Computer Science, pages 103–112, 2010.
- A. Bhaskara, M. Charikar, A. Moitra, and A. Vijayaraghavan. Smoothed analysis of tensor decompositions. In Proceedings of the 46th Annual ACM Symposium on Theory of Computing, 2014.
- B. Boots, S. M. Siddiqi, and G. J. Gordon. Closing the learning-planning loop with predictive state representations. In Proceedings of the Robotics Science and Systems Conference, 2010.
- S. C. Brubaker and S. Vempala. Isotropic PCA and affine-invariant clustering. In FortyNinth Annual IEEE Symposium on Foundations of Computer Science, 2008.
- A. Bunse-Gerstner, R. Byers, and V. Mehrmann. Numerical methods for simultaneous diagonalization. SIAM Journal on Matrix Analysis and Applications, 14(4):927–949, 1993.
- J.-F. Cardoso. Super-symmetric decomposition of the fourth-order cumulant tensor. blind identification of more sources than sensors. In Acoustics, Speech, and Signal Processing, 1991. ICASSP-91., 1991 International Conference on, pages 3109–3112. IEEE, 1991.
- J.-F. Cardoso. Perturbation of joint diagonalizers. Technical Report 94D027, Signal Department, Telecom Paris, 1994.
- J.-F. Cardoso and P. Comon. Independent component analysis, a survey of some algebraic methods. In IEEE International Symposium on Circuits and Systems, pages 93–96, 1996.
- J.-F. Cardoso and A. Souloumiac. Blind beamforming for non Gaussian signals. IEE Proceedings-F, 140(6):362–370, 1993.
- D. Cartwright and B. Sturmfels. The number of eigenvalues of a tensor. Linear Algebra Appl., 438(2):942–952, 2013.
- R. B. Cattell. Parallel proportional profiles and other principles for determining the choice of factors by rotation. Psychometrika, 9(4):267–283, 1944.
- J. T. Chang. Full reconstruction of Markov models on evolutionary trees: Identifiability and consistency. Mathematical Biosciences, 137:51–73, 1996.
- K. Chaudhuri and S. Rao. Learning mixtures of product distributions using correlations and independence. In Twenty-First Annual Conference on Learning Theory, pages 9–20, 2008.
- S. B. Cohen, K. Stratos, M. Collins, D. P. Foster, and L. Ungar. Spectral learning of latent-variable PCFGs. In Fiftieth Annual Meeting of the Association for Computational Linguistics, 2012.
- P. Comon. Independent component analysis, a new concept? Signal Processing, 36(3): 287–314, 1994.
- P. Comon and C. Jutten. Handbook of Blind Source Separation: Independent Component Analysis and Applications. Academic Press. Elsevier, 2010.
- P. Comon, G. Golub, L.-H. Lim, and B. Mourrain. Symmetric tensors and symmetric tensor rank. SIAM Journal on Matrix Analysis Appl., 30(3):1254–1279, 2008.
- R. M. Corless, P. M. Gianni, and B. M. Trager. A reordered Schur factorization method for zero-dimensional polynomial systems with multiple roots. In Proceedings of the 1997 International Symposium on Symbolic and Algebraic Computation, pages 133–140. ACM, 1997.
- S. Dasgupta. Learning mixtures of Gaussians. In Fortieth Annual IEEE Symposium on Foundations of Computer Science, pages 634–644, 1999.
- S. Dasgupta and L. Schulman. A probabilistic analysis of EM for mixtures of separated, spherical Gaussians. Journal of Machine Learning Research, 8(Feb):203–226, 2007.
- L. De Lathauwer, J. Castaing, and J.-F. Cardoso. Fourth-order cumulant-based blind identification of underdetermined mixtures. Signal Processing, IEEE Transactions on, 55(6):2965–2973, 2007.
- N. Delfosse and P. Loubaton. Adaptive blind separation of independent sources: a deflation approach. Signal processing, 45(1):59–83, 1995.
- A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum-likelihood from incomplete data via the EM algorithm. J. Royal Statist. Soc. Ser. B, 39:1–38, 1977.
- P. Dhillon, J. Rodu, M. Collins, D. P. Foster, and L. Ungar. Spectral dependency parsing with latent variables. In Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 2012.
- M. Drton, B. Sturmfels, and S. Sullivant. Algebraic factor analysis: tetrads, pentads and beyond. Probability Theory and Related Fields, 138(3):463–493, 2007.
- A. T. Erdogan. On the convergence of ICA algorithms with symmetric orthogonalization. IEEE Transactions on Signal Processing, 57:2209–2221, 2009.
- A. M. Frieze, M. Jerrum, and R. Kannan. Learning linear transformations. In ThirtySeventh Annual Symposium on Foundations of Computer Science, pages 359–368, 1996.
- G. H. Golub and C. F. van Loan. Matrix Computations. Johns Hopkins University Press, 1996.
- N. Halko, P.-G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2), 2011.
- R. Harshman. Foundations of the PARAFAC procedure: model and conditions for an ‘explanatory’ multi-mode factor analysis. Technical report, UCLA Working Papers in Phonetics, 1970.
- C. J. Hillar and L.-H. Lim. Most tensor problems are NP-hard. J. ACM, 60(6):45:1–45:39, November 2013. ISSN 0004-5411. doi: 10.1145/2512329.
- F. L. Hitchcock. The expression of a tensor or a polyadic as a sum of products. Journal of Mathematics and Physics, 6:164–189, 1927a.
- F. L. Hitchcock. Multiple invariants and generalized rank of a p-way matrix or tensor. Journal of Mathematics and Physics, 7:39–79, 1927b.
- D. Hsu and S. M. Kakade. Learning mixtures of spherical Gaussians: moment methods and spectral decompositions. In Fourth Innovations in Theoretical Computer Science, 2013.
- D. Hsu, S. M. Kakade, and P. Liang. Identifiability and unmixing of latent parse trees. In Advances in Neural Information Processing Systems 25, 2012a.
- D. Hsu, S. M. Kakade, and T. Zhang. A spectral algorithm for learning hidden Markov models. Journal of Computer and System Sciences, 78(5):1460–1480, 2012b.
- A. Hyvarinen. Fast and robust fixed-point algorithms for independent component analysis. Neural Networks, IEEE Transactions on, 10(3):626–634, 1999.
- A. Hyvarinen and E. Oja. Independent component analysis: algorithms and applications. Neural Networks, 13(4–5):411–430, 2000.
- H. Jaeger. Observable operator models for discrete stochastic time series. Neural Comput., 12(6), 2000.
- A. T. Kalai, A. Moitra, and G. Valiant. Efficiently learning mixtures of two Gaussians. In Forty-second ACM Symposium on Theory of Computing, pages 553–562, 2010.
- R. Kannan, H. Salmasian, and S. Vempala. The spectral method for general mixture models. SIAM Journal on Computing, 38(3):1141–1156, 2008.
- E. Kofidis and P. A. Regalia. On the best rank-1 approximation of higher-order supersymmetric tensors. SIAM Journal on Matrix Analysis and Applications, 23(3):863–884, 2002.
- T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM review, 51 (3):455, 2009.
- T. G. Kolda and J. R. Mayo. Shifted power method for computing tensor eigenpairs. SIAM Journal on Matrix Analysis and Applications, 32(4):1095–1124, October 2011.
- J. B. Kruskal. Three-way arrays: rank and uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics. Linear Algebra and Appl., 18(2): 95–138, 1977.
- L. D. Lathauwer, B. D. Moor, and J. Vandewalle. On the best rank-1 and rank(R1, R2,..., Rn) approximation and applications of higher-order tensors. SIAM J. Matrix Anal. Appl., 21(4):1324–1342, 2000.
- L. Le Cam. Asymptotic Methods in Statistical Decision Theory. Springer, 1986.
- L.-H. Lim. Singular values and eigenvalues of tensors: a variational approach. Proceedings of the IEEE International Workshop on Computational Advances in Multi-Sensor Adaptive Processing, 1:129–132, 2005.
- M. Littman, R. Sutton, and S. Singh. Predictive representations of state. In Advances in Neural Information Processing Systems 14, pages 1555–1561, 2001.
- F. M. Luque, A. Quattoni, B. Balle, and X. Carreras. Spectral learning for non-deterministic dependency parsing. In Conference of the European Chapter of the Association for Computational Linguistics, 2012.
- J. B. MacQueen. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 281–297. University of California Press, 1967.
- P. McCullagh. Tensor Methods in Statistics. Chapman and Hall, 1987.
- A. Moitra and G. Valiant. Settling the polynomial learnability of mixtures of Gaussians. In Fifty-First Annual IEEE Symposium on Foundations of Computer Science, pages 93–102, 2010.
- E. Mossel and S. Roch. Learning nonsingular phylogenies and hidden Markov models. Annals of Applied Probability, 16(2):583–614, 2006.
- P. Q. Nguyen and O. Regev. Learning a parallelepiped: Cryptanalysis of GGH and NTRU signatures. Journal of Cryptology, 22(2):139–160, 2009.
- J. Nocedal and S. J. Wright. Numerical Optimization. Springer, 1999.
- P. V. Overschee and B. D. Moor. Subspace Identification of Linear Systems. Kluwer Academic Publishers, 1996.
- L. Pachter and B. Sturmfels. Algebraic Statistics for Computational Biology, volume 13. Cambridge University Press, 2005.
- A. Parikh, L. Song, and E. P. Xing. A spectral algorithm for latent tree graphical models. In Twenty-Eighth International Conference on Machine Learning, 2011.
- K. Pearson. Contributions to the mathematical theory of evolution. Philosophical Transactions of the Royal Society, London, A., page 71, 1894.
- L. Qi. Eigenvalues of a real supersymmetric tensor. Journal of Symbolic Computation, 40 (6):1302–1324, 2005.
- R. A. Redner and H. F. Walker. Mixture densities, maximum likelihood and the EM algorithm. SIAM Review, 26(2):195–239, 1984.
- P. A. Regalia and E. Kofidis. Monotonic convergence of fixed-point algorithms for ICA. IEEE Transactions on Neural Networks, 14:943–949, 2003.
- S. Roch. A short proof that phylogenetic tree reconstruction by maximum likelihood is hard. IEEE/ACM Trans. Comput. Biol. Bioinformatics, 3(1), 2006.
- J. Rodu, D. P. Foster, W. Wu, and L. H. Ungar. Using regression for spectral estimation of HMMs. In Statistical Language and Speech Processing, pages 212–223, 2013.
- M. P. Schutzenberger. On the definition of a family of automata. Inf. Control, 4:245–270, 1961.
- S. M. Siddiqi, B. Boots, and G. J. Gordon. Reduced-rank hidden Markov models. In Thirteenth International Conference on Artificial Intelligence and Statistics, 2010.
- D. A. Spielman and S. H. Teng. Smoothed analysis: An attempt to explain the behavior of algorithms in practice. Communications of the ACM, pages 76–84, 2009.
- A. Stegeman and P. Comon. Subtracting a best rank-1 approximation may increase tensor rank. Linear Algebra and Its Applications, 433:1276–1300, 2010.
- B. Sturmfels and P. Zwiernik. Binary cumulant varieties. Ann. Comb., (17):229–250, 2013.
- S. Vempala and G. Wang. A spectral algorithm for learning mixtures models. Journal of Computer and System Sciences, 68(4):841–860, 2004.
- P. Wedin. Perturbation bounds in connection with singular value decomposition. BIT Numerical Mathematics, 12(1):99–111, 1972.
- T. Zhang and G. Golub. Rank-one approximation to high order tensors. SIAM Journal on Matrix Analysis and Applications, 23:534–550, 2001.
- A. Ziehe, P. Laskov, G. Nolte, and K. R. Muller. A fast algorithm for joint diagonalization with non-orthogonal transformations and its application to blind source separation. Journal of Machine Learning Research, 5:777–800, 2004.
- J. Zou, D. Hsu, D. Parkes, and R. P. Adams. Contrastive learning using spectral methods. In Advances in Neural Information Processing Systems 26, 2013.

标签

评论

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn