AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
It will be interesting to investigate the performance of Reparameterization Trick in other applications of Variational AutoEncoder beyond topic modelling

Learning VAE LDA Models with Rounded Reparameterization Trick

EMNLP 2020, pp.1315-1325, (2020)

Cited by: 0|Views177
Full Text
Bibtex
Weibo

Abstract

The introduction of VAE provides an efficient framework for the learning of generative models, including generative topic models. However, when the topic model is a Latent Dirichlet Allocation (LDA) model, a central technique of VAE, the reparameterization trick, fails to be applicable. This is because no reparameterization form of Dirich...More

Code:

Data:

0
Introduction
  • Probabilistic generative models are widely used in topic modelling and have achieved great success in many applications (Deerwester et al, 1990)(Hofmann, 1999)(Blei et al, 2003)(Blei and Lafferty, 2006).
  • A landmark of topic models is Latent Dirichlet Allocation (LDA) (Blei et al, 2003), where a document is treated as a bag of words and each word is modelled via a generative process
  • In this generative process, a topic distribution is first drawn from a Dirichlet prior, a topic is sampled from the topic distribution and a word is drawn subsequently from the word distribution corresponding to the drawn topic.
  • This technique allows the model to be trained efficiently using back propagation
Highlights
  • Probabilistic generative models are widely used in topic modelling and have achieved great success in many applications (Deerwester et al, 1990)(Hofmann, 1999)(Blei et al, 2003)(Blei and Lafferty, 2006)
  • A landmark of topic models is Latent Dirichlet Allocation (LDA) (Blei et al, 2003), where a document is treated as a bag of words and each word is modelled via a generative process
  • In this generative process, a topic distribution is first drawn from a Dirichlet prior, a topic is sampled from the topic distribution and a word is drawn subsequently from the word distribution corresponding to the drawn topic
  • When ∆ = 1, the training loss remains at a high value and the corresponding recovery accuracy is lower than 60%, indicating that Reparameterization Trick (RRT)-Variational AutoEncoder (VAE) fails to fit the true data distribution
  • The applicability of RRT can be generalized beyond Dirichlet distributions
  • It will be interesting to investigate the performance of RRT in other applications of VAE beyond topic modelling
Results
  • To quantitatively evaluate RRT-VAE, the authors conduct experiments on synthetic datasets and five realworld datasets.
  • Figure 3 reports how different ∆ settings influence the recovery accuracy of RRT-VAE on three synthetic datasets.
  • When ∆ = 1, the training loss remains at a high value and the corresponding recovery accuracy is lower than 60%, indicating that RRT-VAE fails to fit the true data distribution.
  • When ∆ = 10−10, RRT-VAE fits the data well: the training loss drops rapidly and converges to a much lower value; the resulting recovery accuracy reaches up to 90%
Conclusion
  • Concluding Remarks

    In this paper, rounded reparameterization trick, or RRT, is shown as an effective and efficient reparameterization method for Dirichlet distributions in the context of learning VAE based LDA models.
  • The applicability of RRT can be generalized beyond Dirichlet distributions.
  • This is because any distribution can be reparameterized to an “RRT form” as long as a sampling algorithm exists for that distribution.
  • It will be interesting to investigate the performance of RRT in other applications of VAE beyond topic modelling.
  • Successes in these investigations will certainly extend the applicability of VAE to much broader application domains and model families
Summary
  • Introduction:

    Probabilistic generative models are widely used in topic modelling and have achieved great success in many applications (Deerwester et al, 1990)(Hofmann, 1999)(Blei et al, 2003)(Blei and Lafferty, 2006).
  • A landmark of topic models is Latent Dirichlet Allocation (LDA) (Blei et al, 2003), where a document is treated as a bag of words and each word is modelled via a generative process
  • In this generative process, a topic distribution is first drawn from a Dirichlet prior, a topic is sampled from the topic distribution and a word is drawn subsequently from the word distribution corresponding to the drawn topic.
  • This technique allows the model to be trained efficiently using back propagation
  • Objectives:

    The authors' goal is to use the topic models to recover this matrix.
  • Results:

    To quantitatively evaluate RRT-VAE, the authors conduct experiments on synthetic datasets and five realworld datasets.
  • Figure 3 reports how different ∆ settings influence the recovery accuracy of RRT-VAE on three synthetic datasets.
  • When ∆ = 1, the training loss remains at a high value and the corresponding recovery accuracy is lower than 60%, indicating that RRT-VAE fails to fit the true data distribution.
  • When ∆ = 10−10, RRT-VAE fits the data well: the training loss drops rapidly and converges to a much lower value; the resulting recovery accuracy reaches up to 90%
  • Conclusion:

    Concluding Remarks

    In this paper, rounded reparameterization trick, or RRT, is shown as an effective and efficient reparameterization method for Dirichlet distributions in the context of learning VAE based LDA models.
  • The applicability of RRT can be generalized beyond Dirichlet distributions.
  • This is because any distribution can be reparameterized to an “RRT form” as long as a sampling algorithm exists for that distribution.
  • It will be interesting to investigate the performance of RRT in other applications of VAE beyond topic modelling.
  • Successes in these investigations will certainly extend the applicability of VAE to much broader application domains and model families
Tables
  • Table1: Summary of different datasets
  • Table2: Evaluation results on RRT-VAE with different prior settings. Perplexity: lower is better; NPMI: higher is better; Sparsity: higher means sparser
  • Table3: Evaluation results of RRT-VAE with different λ settings
  • Table4: Optimal λ settings of RRT-VAE for different datasets
  • Table5: Perplexity/NPMI of the compared topic models on five datasets. The number of topic is set to 50
  • Table6: Perplexity/NPMI of the compared topic models on five datasets. The number of topic is set to 200
  • Table7: Topic words extracted from the Yelp dataset. From top to bottom, each cell is extracted by NVDM, ProdLDA, DirVAE and RRT-VAE. More examples are exhibited in Appendix B.3
  • Table8: Recovery accuracy of four topic models on synthetic datasets generated by three different αg settings. For RRT-VAE, λ is set to 1; ∆ is set to 10−10
  • Table9: Topic words recovery accuracy of three neural topic models on synthetic datasets generated with three different αg settings. The models adopt the same prod decoder structure. For RRT-VAE, λ is set to 1; ∆ is set to 10−10
  • Table10: Left: the ground truth topic word matrix Tg; Right: a matrix TL learned by RRT-VAE. Note that the rows of TL are arbitrarily ordered. For example, the first and second rows of Tg individually correspond to the 11th and 14th rows of TL (as shown in bold)
  • Table11: The standard decoder appears to extract many repetitive words on 20NG
  • Table12: The experimental results of Online LDA on the 20NG dataset
  • Table13: Topic words extracted by RRT-VAE from four different datasets. From top to bottom, each cell is extracted from 20NG, AGNews, RCV1-v2 and DBpedia
Download tables as Excel
Funding
  • When ∆ = 1, the training loss remains at a high value and the corresponding recovery accuracy is lower than 60%, indicating that RRT-VAE fails to fit the true data distribution
  • When ∆ = 10−10, RRT-VAE fits the data well: the training loss drops rapidly and converges to a much lower value; the resulting recovery accuracy reaches up to 90%
Study subjects and analysis
realworld datasets: 5
5 Experiments and Results. To quantitatively evaluate RRT-VAE, we conduct experiments on synthetic datasets and five realworld datasets. Our model is compared with several existing topic models: Online LDA (Hoffman et al, 2010), NVDM (Miao et al, 2016), ProdLDA (Srivastava and Sutton, 2017) and DirVAE (Joo et al, 2019)

synthetic datasets: 3
Synthetic datasets. We construct three synthetic datasets based on the LDA generative process: a 30 × 500 topic-word probability matrix βg is generated as the ground truth; each dataset is then generated based on βg using different Dirichlet priors αg ·1 ∈ R30, where 1 denotes the all-one vector. We set αg to [0.01, 0.05, 0.1] for the three datasets and the vocabulary size to 500

datasets: 3
We construct three synthetic datasets based on the LDA generative process: a 30 × 500 topic-word probability matrix βg is generated as the ground truth; each dataset is then generated based on βg using different Dirichlet priors αg ·1 ∈ R30, where 1 denotes the all-one vector. We set αg to [0.01, 0.05, 0.1] for the three datasets and the vocabulary size to 500. Each dataset has 20000 training examples

real-world datasets: 5
Real-world datasets. We use five real-world datasets in our experiments: 20NG, RCV1-v2, 3 AGNews4, DBPeida (Lehmann et al, 2015), and Yelp review polarity (Zhang et al, 2015). The 20NG and RCV1-v2 datasets are the same as (Miao et al, 2016)

datasets: 3
The 20NG and RCV1-v2 datasets are the same as (Miao et al, 2016). The other three datasets are preprocessed through tokenizing, stemming, lemmatizing and the removal of stop words. We keep the most frequent 2000 words in DBPedia and Yelp

documents: 15
We keep the most frequent 2000 words in DBPedia and Yelp. For AGNews, we keep the words which are contained in no more than half the documents and are contained in at least 15 documents. The statistics of the cleaned datasets are summarized in Table 1

training samples: 1000
The sparsity of θ in turn makes it easier for the model to assign a very small probability on some existing words in a document and thus increases the training loss and perplexity. To verify this conjecture, we construct a simple method to measure sparsity: after the training, we randomly feed 1000 training samples into the encoder network and obtain 1000 topic distribution vectors {θi}1i=0010. For each θi, we calculate the difference between its largest and smallest probability value and then average these differences over the 1000 samples

samples: 1000
To verify this conjecture, we construct a simple method to measure sparsity: after the training, we randomly feed 1000 training samples into the encoder network and obtain 1000 topic distribution vectors {θi}1i=0010. For each θi, we calculate the difference between its largest and smallest probability value and then average these differences over the 1000 samples. Clearly, a larger difference value indicates a sparser θ, e.g. the maximum difference 1 is achieved by a one-hot vector

synthetic datasets: 3
As shown, all the training losses decrease stably, although a higher ∆ setting hinders the loss converging to a lower value. Figure 3 (right) reports how different ∆ settings influence the recovery accuracy of RRT-VAE on three synthetic datasets. It can be seen that a smaller ∆ achieves a better performance

synthetic datasets: 3
But it is not the case for the other models. We compare RRT-VAE with Online LDA, ProdLDA and DirVAE on three synthetic datasets which are generated by different Dirichlet parameters. The compared three neural topic models adopt the same standard decoder of (4)

datasets: 5
Optimal λ settings of RRT-VAE for different datasets. Perplexity/NPMI of the compared topic models on five datasets. The number of topic is set to 50. Perplexity/NPMI of the compared topic models on five datasets. The number of topic is set to 200

datasets: 5
Perplexity/NPMI of the compared topic models on five datasets. The number of topic is set to 50. Perplexity/NPMI of the compared topic models on five datasets. The number of topic is set to 200. Topic words extracted from the Yelp dataset. From top to bottom, each cell is extracted by NVDM, ProdLDA, DirVAE and RRT-VAE. More examples are exhibited in Appendix B.3

synthetic datasets: 3
a) Training performance of RRT-VAE with different ∆ settings; (b)-(d) perplexity, NPMI and sparsity of RRT-VAE with different ∆ and prior αsettings. In these experiments, λ is set to 0.01, the number of topics is set to 50. left) exhibits how different ∆ settings influence the training performance of RRT-VAE when αg = 0.01 (the results of αg = 0.05 and 0.1 are shown in Appendix A.2). As shown, all the training losses decrease stably, although a higher ∆ setting hinders the loss converging to a lower value. right) reports how different ∆ settings influence the recovery accuracy of RRT-VAE on three synthetic datasets. It can be seen that a smaller ∆ achieves a better performance. Specifically, when ∆ = 1, the training loss remains at a high value and the corresponding recovery accuracy is lower than 60%, indicating that RRT-VAE fails to fit the true data distribution. In contrast, when ∆ = 10−10, RRT-VAE fits the data well: the training loss drops rapidly and converges to a much lower value; the resulting recovery accuracy reaches up to 90%. Training performances (left) and recovery accuracy (right) of RRT-VAE on a synthetic dataset (αg = 0.01) with different ∆ settings. Training performances of RRT-VAE with different ∆ settings. Left: αg = 0.05; right: αg = 0.1

Reference
  • Loulwah AlSumait, Daniel Barbara, and Carlotta Domeniconi. 2008. On-line lda: Adaptive topic models for mining text streams with applications to topic detection and tracking. In 2008 eighth IEEE international conference on data mining, pages 3–12. IEEE.
    Google ScholarLocate open access versionFindings
  • David M Blei and John D Lafferty. 2006. Dynamic topic models. In Proceedings of the 23rd international conference on Machine learning, pages 113– 120.
    Google ScholarLocate open access versionFindings
  • David M Blei, Andrew Y Ng, and Michael I Jordan. 200Lda. Journal of machine Learning research, 3(Jan):993–1022.
    Google ScholarLocate open access versionFindings
  • Jonathan Chang, Sean Gerrish, Chong Wang, Jordan L Boyd-Graber, and David M Blei. 2009. Reading tea leaves: How humans interpret topic models. In Advances in neural information processing systems, pages 288–296.
    Google ScholarLocate open access versionFindings
  • Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the American society for information science, 41(6):391–407.
    Google ScholarLocate open access versionFindings
  • Mikhail Figurnov, Shakir Mohamed, and Andriy Mnih. 2018. Implicit reparameterization gradients. In Advances in Neural Information Processing Systems, pages 441–452.
    Google ScholarLocate open access versionFindings
  • Peter W Glynn. 1990. Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM, 33(10):75–84.
    Google ScholarLocate open access versionFindings
  • Evan Greensmith, Peter L Bartlett, and Jonathan Baxter. 2004. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471–1530.
    Google ScholarLocate open access versionFindings
  • Thomas L Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proceedings of the National academy of Sciences, 101(suppl 1):5228–5235.
    Google ScholarLocate open access versionFindings
  • Matthew Hoffman, Francis R Bach, and David M Blei. 20Online learning for latent dirichlet allocation. In advances in neural information processing systems, pages 856–864.
    Google ScholarLocate open access versionFindings
  • Thomas Hofmann. 1999. Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 50–57.
    Google ScholarLocate open access versionFindings
  • Weonyoung Joo, Wonsung Lee, Sungrae Park, and IlChul Moon. 2019. Dirichlet variational autoencoder. arXiv preprint arXiv:1901.02739.
    Findings
  • Diederik P Kingma and Max Welling. 20Autoencoding variational bayes. arXiv preprint arXiv:1312.6114.
    Findings
  • David A Knowles. 2015. Stochastic gradient variational bayes for gamma approximating distributions. arXiv preprint arXiv:1509.01631.
    Findings
  • Jey Han Lau, David Newman, and Timothy Baldwin. 2014. Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 530–539.
    Google ScholarLocate open access versionFindings
  • Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick Van Kleef, Soren Auer, et al. 2015. Dbpedia–a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web, 6(2):167–195.
    Google ScholarLocate open access versionFindings
  • Rishabh Mehrotra, Scott Sanner, Wray Buntine, and Lexing Xie. 2013. Improving lda topic models for microblogs via tweet pooling and automatic labeling. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, pages 889–892.
    Google ScholarLocate open access versionFindings
  • Yishu Miao, Lei Yu, and Phil Blunsom. 2016. Neural variational inference for text processing. In International conference on machine learning, pages 1727– 1736.
    Google ScholarLocate open access versionFindings
  • Christian A Naesseth, Francisco JR Ruiz, Scott W Linderman, and David M Blei. 2016. Reparameterization gradients through acceptance-rejection sampling algorithms. arXiv preprint arXiv:1610.05683.
    Findings
  • David Newman, Jey Han Lau, Karl Grieser, and Timothy Baldwin. 2010. Automatic evaluation of topic coherence. In Human language technologies: The 2010 annual conference of the North American chapter of the association for computational linguistics, pages 100–108. Association for Computational Linguistics.
    Google ScholarFindings
  • Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. 2011. Scikit-learn: Machine learning in python. Journal of machine learning research, 12(Oct):2825–2830.
    Google ScholarLocate open access versionFindings
  • Xing Wei and W Bruce Croft. 2006. Lda-based document models for ad-hoc retrieval. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 178–185.
    Google ScholarLocate open access versionFindings
  • Ronald J Williams. 1992. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256.
    Google ScholarLocate open access versionFindings
  • Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In Advances in neural information processing systems, pages 649–657.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科