## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# Learning VAE LDA Models with Rounded Reparameterization Trick

EMNLP 2020, pp.1315-1325, (2020)

Keywords

Abstract

The introduction of VAE provides an efficient framework for the learning of generative models, including generative topic models. However, when the topic model is a Latent Dirichlet Allocation (LDA) model, a central technique of VAE, the reparameterization trick, fails to be applicable. This is because no reparameterization form of Dirich...More

Code:

Data:

Introduction

- Probabilistic generative models are widely used in topic modelling and have achieved great success in many applications (Deerwester et al, 1990)(Hofmann, 1999)(Blei et al, 2003)(Blei and Lafferty, 2006).
- A landmark of topic models is Latent Dirichlet Allocation (LDA) (Blei et al, 2003), where a document is treated as a bag of words and each word is modelled via a generative process
- In this generative process, a topic distribution is first drawn from a Dirichlet prior, a topic is sampled from the topic distribution and a word is drawn subsequently from the word distribution corresponding to the drawn topic.
- This technique allows the model to be trained efficiently using back propagation

Highlights

- Probabilistic generative models are widely used in topic modelling and have achieved great success in many applications (Deerwester et al, 1990)(Hofmann, 1999)(Blei et al, 2003)(Blei and Lafferty, 2006)
- A landmark of topic models is Latent Dirichlet Allocation (LDA) (Blei et al, 2003), where a document is treated as a bag of words and each word is modelled via a generative process
- In this generative process, a topic distribution is first drawn from a Dirichlet prior, a topic is sampled from the topic distribution and a word is drawn subsequently from the word distribution corresponding to the drawn topic
- When ∆ = 1, the training loss remains at a high value and the corresponding recovery accuracy is lower than 60%, indicating that Reparameterization Trick (RRT)-Variational AutoEncoder (VAE) fails to fit the true data distribution
- The applicability of RRT can be generalized beyond Dirichlet distributions
- It will be interesting to investigate the performance of RRT in other applications of VAE beyond topic modelling

Results

- To quantitatively evaluate RRT-VAE, the authors conduct experiments on synthetic datasets and five realworld datasets.
- Figure 3 reports how different ∆ settings influence the recovery accuracy of RRT-VAE on three synthetic datasets.
- When ∆ = 1, the training loss remains at a high value and the corresponding recovery accuracy is lower than 60%, indicating that RRT-VAE fails to fit the true data distribution.
- When ∆ = 10−10, RRT-VAE fits the data well: the training loss drops rapidly and converges to a much lower value; the resulting recovery accuracy reaches up to 90%

Conclusion

**Concluding Remarks**

In this paper, rounded reparameterization trick, or RRT, is shown as an effective and efficient reparameterization method for Dirichlet distributions in the context of learning VAE based LDA models.- The applicability of RRT can be generalized beyond Dirichlet distributions.
- This is because any distribution can be reparameterized to an “RRT form” as long as a sampling algorithm exists for that distribution.
- It will be interesting to investigate the performance of RRT in other applications of VAE beyond topic modelling.
- Successes in these investigations will certainly extend the applicability of VAE to much broader application domains and model families

Summary

## Introduction:

Probabilistic generative models are widely used in topic modelling and have achieved great success in many applications (Deerwester et al, 1990)(Hofmann, 1999)(Blei et al, 2003)(Blei and Lafferty, 2006).- A landmark of topic models is Latent Dirichlet Allocation (LDA) (Blei et al, 2003), where a document is treated as a bag of words and each word is modelled via a generative process
- In this generative process, a topic distribution is first drawn from a Dirichlet prior, a topic is sampled from the topic distribution and a word is drawn subsequently from the word distribution corresponding to the drawn topic.
- This technique allows the model to be trained efficiently using back propagation
## Objectives:

The authors' goal is to use the topic models to recover this matrix.## Results:

To quantitatively evaluate RRT-VAE, the authors conduct experiments on synthetic datasets and five realworld datasets.- Figure 3 reports how different ∆ settings influence the recovery accuracy of RRT-VAE on three synthetic datasets.
- When ∆ = 1, the training loss remains at a high value and the corresponding recovery accuracy is lower than 60%, indicating that RRT-VAE fails to fit the true data distribution.
- When ∆ = 10−10, RRT-VAE fits the data well: the training loss drops rapidly and converges to a much lower value; the resulting recovery accuracy reaches up to 90%
## Conclusion:

**Concluding Remarks**

In this paper, rounded reparameterization trick, or RRT, is shown as an effective and efficient reparameterization method for Dirichlet distributions in the context of learning VAE based LDA models.- The applicability of RRT can be generalized beyond Dirichlet distributions.
- This is because any distribution can be reparameterized to an “RRT form” as long as a sampling algorithm exists for that distribution.
- It will be interesting to investigate the performance of RRT in other applications of VAE beyond topic modelling.
- Successes in these investigations will certainly extend the applicability of VAE to much broader application domains and model families

- Table1: Summary of different datasets
- Table2: Evaluation results on RRT-VAE with different prior settings. Perplexity: lower is better; NPMI: higher is better; Sparsity: higher means sparser
- Table3: Evaluation results of RRT-VAE with different λ settings
- Table4: Optimal λ settings of RRT-VAE for different datasets
- Table5: Perplexity/NPMI of the compared topic models on five datasets. The number of topic is set to 50
- Table6: Perplexity/NPMI of the compared topic models on five datasets. The number of topic is set to 200
- Table7: Topic words extracted from the Yelp dataset. From top to bottom, each cell is extracted by NVDM, ProdLDA, DirVAE and RRT-VAE. More examples are exhibited in Appendix B.3
- Table8: Recovery accuracy of four topic models on synthetic datasets generated by three different αg settings. For RRT-VAE, λ is set to 1; ∆ is set to 10−10
- Table9: Topic words recovery accuracy of three neural topic models on synthetic datasets generated with three different αg settings. The models adopt the same prod decoder structure. For RRT-VAE, λ is set to 1; ∆ is set to 10−10
- Table10: Left: the ground truth topic word matrix Tg; Right: a matrix TL learned by RRT-VAE. Note that the rows of TL are arbitrarily ordered. For example, the first and second rows of Tg individually correspond to the 11th and 14th rows of TL (as shown in bold)
- Table11: The standard decoder appears to extract many repetitive words on 20NG
- Table12: The experimental results of Online LDA on the 20NG dataset
- Table13: Topic words extracted by RRT-VAE from four different datasets. From top to bottom, each cell is extracted from 20NG, AGNews, RCV1-v2 and DBpedia

Funding

- When ∆ = 1, the training loss remains at a high value and the corresponding recovery accuracy is lower than 60%, indicating that RRT-VAE fails to fit the true data distribution
- When ∆ = 10−10, RRT-VAE fits the data well: the training loss drops rapidly and converges to a much lower value; the resulting recovery accuracy reaches up to 90%

Study subjects and analysis

realworld datasets: 5

5 Experiments and Results. To quantitatively evaluate RRT-VAE, we conduct experiments on synthetic datasets and five realworld datasets. Our model is compared with several existing topic models: Online LDA (Hoffman et al, 2010), NVDM (Miao et al, 2016), ProdLDA (Srivastava and Sutton, 2017) and DirVAE (Joo et al, 2019)

synthetic datasets: 3

Synthetic datasets. We construct three synthetic datasets based on the LDA generative process: a 30 × 500 topic-word probability matrix βg is generated as the ground truth; each dataset is then generated based on βg using different Dirichlet priors αg ·1 ∈ R30, where 1 denotes the all-one vector. We set αg to [0.01, 0.05, 0.1] for the three datasets and the vocabulary size to 500

datasets: 3

We construct three synthetic datasets based on the LDA generative process: a 30 × 500 topic-word probability matrix βg is generated as the ground truth; each dataset is then generated based on βg using different Dirichlet priors αg ·1 ∈ R30, where 1 denotes the all-one vector. We set αg to [0.01, 0.05, 0.1] for the three datasets and the vocabulary size to 500. Each dataset has 20000 training examples

real-world datasets: 5

Real-world datasets. We use five real-world datasets in our experiments: 20NG, RCV1-v2, 3 AGNews4, DBPeida (Lehmann et al, 2015), and Yelp review polarity (Zhang et al, 2015). The 20NG and RCV1-v2 datasets are the same as (Miao et al, 2016)

datasets: 3

The 20NG and RCV1-v2 datasets are the same as (Miao et al, 2016). The other three datasets are preprocessed through tokenizing, stemming, lemmatizing and the removal of stop words. We keep the most frequent 2000 words in DBPedia and Yelp

documents: 15

We keep the most frequent 2000 words in DBPedia and Yelp. For AGNews, we keep the words which are contained in no more than half the documents and are contained in at least 15 documents. The statistics of the cleaned datasets are summarized in Table 1

training samples: 1000

The sparsity of θ in turn makes it easier for the model to assign a very small probability on some existing words in a document and thus increases the training loss and perplexity. To verify this conjecture, we construct a simple method to measure sparsity: after the training, we randomly feed 1000 training samples into the encoder network and obtain 1000 topic distribution vectors {θi}1i=0010. For each θi, we calculate the difference between its largest and smallest probability value and then average these differences over the 1000 samples

samples: 1000

To verify this conjecture, we construct a simple method to measure sparsity: after the training, we randomly feed 1000 training samples into the encoder network and obtain 1000 topic distribution vectors {θi}1i=0010. For each θi, we calculate the difference between its largest and smallest probability value and then average these differences over the 1000 samples. Clearly, a larger difference value indicates a sparser θ, e.g. the maximum difference 1 is achieved by a one-hot vector

synthetic datasets: 3

As shown, all the training losses decrease stably, although a higher ∆ setting hinders the loss converging to a lower value. Figure 3 (right) reports how different ∆ settings influence the recovery accuracy of RRT-VAE on three synthetic datasets. It can be seen that a smaller ∆ achieves a better performance

synthetic datasets: 3

But it is not the case for the other models. We compare RRT-VAE with Online LDA, ProdLDA and DirVAE on three synthetic datasets which are generated by different Dirichlet parameters. The compared three neural topic models adopt the same standard decoder of (4)

datasets: 5

Optimal λ settings of RRT-VAE for different datasets. Perplexity/NPMI of the compared topic models on five datasets. The number of topic is set to 50. Perplexity/NPMI of the compared topic models on five datasets. The number of topic is set to 200

datasets: 5

Perplexity/NPMI of the compared topic models on five datasets. The number of topic is set to 50. Perplexity/NPMI of the compared topic models on five datasets. The number of topic is set to 200. Topic words extracted from the Yelp dataset. From top to bottom, each cell is extracted by NVDM, ProdLDA, DirVAE and RRT-VAE. More examples are exhibited in Appendix B.3

synthetic datasets: 3

a) Training performance of RRT-VAE with different ∆ settings; (b)-(d) perplexity, NPMI and sparsity of RRT-VAE with different ∆ and prior αsettings. In these experiments, λ is set to 0.01, the number of topics is set to 50. left) exhibits how different ∆ settings influence the training performance of RRT-VAE when αg = 0.01 (the results of αg = 0.05 and 0.1 are shown in Appendix A.2). As shown, all the training losses decrease stably, although a higher ∆ setting hinders the loss converging to a lower value. right) reports how different ∆ settings influence the recovery accuracy of RRT-VAE on three synthetic datasets. It can be seen that a smaller ∆ achieves a better performance. Specifically, when ∆ = 1, the training loss remains at a high value and the corresponding recovery accuracy is lower than 60%, indicating that RRT-VAE fails to fit the true data distribution. In contrast, when ∆ = 10−10, RRT-VAE fits the data well: the training loss drops rapidly and converges to a much lower value; the resulting recovery accuracy reaches up to 90%. Training performances (left) and recovery accuracy (right) of RRT-VAE on a synthetic dataset (αg = 0.01) with different ∆ settings. Training performances of RRT-VAE with different ∆ settings. Left: αg = 0.05; right: αg = 0.1

Reference

- Loulwah AlSumait, Daniel Barbara, and Carlotta Domeniconi. 2008. On-line lda: Adaptive topic models for mining text streams with applications to topic detection and tracking. In 2008 eighth IEEE international conference on data mining, pages 3–12. IEEE.
- David M Blei and John D Lafferty. 2006. Dynamic topic models. In Proceedings of the 23rd international conference on Machine learning, pages 113– 120.
- David M Blei, Andrew Y Ng, and Michael I Jordan. 200Lda. Journal of machine Learning research, 3(Jan):993–1022.
- Jonathan Chang, Sean Gerrish, Chong Wang, Jordan L Boyd-Graber, and David M Blei. 2009. Reading tea leaves: How humans interpret topic models. In Advances in neural information processing systems, pages 288–296.
- Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the American society for information science, 41(6):391–407.
- Mikhail Figurnov, Shakir Mohamed, and Andriy Mnih. 2018. Implicit reparameterization gradients. In Advances in Neural Information Processing Systems, pages 441–452.
- Peter W Glynn. 1990. Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM, 33(10):75–84.
- Evan Greensmith, Peter L Bartlett, and Jonathan Baxter. 2004. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471–1530.
- Thomas L Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proceedings of the National academy of Sciences, 101(suppl 1):5228–5235.
- Matthew Hoffman, Francis R Bach, and David M Blei. 20Online learning for latent dirichlet allocation. In advances in neural information processing systems, pages 856–864.
- Thomas Hofmann. 1999. Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 50–57.
- Weonyoung Joo, Wonsung Lee, Sungrae Park, and IlChul Moon. 2019. Dirichlet variational autoencoder. arXiv preprint arXiv:1901.02739.
- Diederik P Kingma and Max Welling. 20Autoencoding variational bayes. arXiv preprint arXiv:1312.6114.
- David A Knowles. 2015. Stochastic gradient variational bayes for gamma approximating distributions. arXiv preprint arXiv:1509.01631.
- Jey Han Lau, David Newman, and Timothy Baldwin. 2014. Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 530–539.
- Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick Van Kleef, Soren Auer, et al. 2015. Dbpedia–a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web, 6(2):167–195.
- Rishabh Mehrotra, Scott Sanner, Wray Buntine, and Lexing Xie. 2013. Improving lda topic models for microblogs via tweet pooling and automatic labeling. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, pages 889–892.
- Yishu Miao, Lei Yu, and Phil Blunsom. 2016. Neural variational inference for text processing. In International conference on machine learning, pages 1727– 1736.
- Christian A Naesseth, Francisco JR Ruiz, Scott W Linderman, and David M Blei. 2016. Reparameterization gradients through acceptance-rejection sampling algorithms. arXiv preprint arXiv:1610.05683.
- David Newman, Jey Han Lau, Karl Grieser, and Timothy Baldwin. 2010. Automatic evaluation of topic coherence. In Human language technologies: The 2010 annual conference of the North American chapter of the association for computational linguistics, pages 100–108. Association for Computational Linguistics.
- Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. 2011. Scikit-learn: Machine learning in python. Journal of machine learning research, 12(Oct):2825–2830.
- Xing Wei and W Bruce Croft. 2006. Lda-based document models for ad-hoc retrieval. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 178–185.
- Ronald J Williams. 1992. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256.
- Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In Advances in neural information processing systems, pages 649–657.

Tags

Comments

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn