AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
Our modular method sits atop any neural topic model to improve topic quality, which we demonstrate using two NTMs of highly disparate architectures, obtaining state-of-the-art topic coherence across three datasets from different domains

Improving Neural Topic Models using Knowledge Distillation

EMNLP 2020, pp.1752-1771, (2020)

Cited by: 0|Views167
Full Text
Bibtex
Weibo

Abstract

Topic models are often used to identify human-interpretable topics to help make sense of large document collections. We use knowledge distillation to combine the best attributes of probabilistic topic models and pretrained transformers. Our modular method can be straightforwardly applied with any neural topic model to improve topic qualit...More

Code:

Data:

0
Introduction
  • The core idea behind the predominant pretrain and fine-tune paradigm for transfer learning in NLP is that general language knowledge, gleaned from large quantities of data using unsupervised objectives, can serve as a foundation for more specialized endeavors.
  • Current practice involves taking the full model that has amassed such general knowledge and fine-tuning it with a second objective appropriate to the new task
  • Using these methods, pre-trained transformer-based language models (e.g., BERT, Devlin et al, 2019) have been employed to great effect on a wide variety of NLP problems, thanks, in part, to a fine-grained ability to capture aspects of linguistic context (Clark et al, 2019; Liu et al, 2019; Rogers et al, 2020).
  • During backpropagation (Eq (3)), the topic parameters will only update to account for observed terms, which can lead to overfitting and topics with suboptimal coherence.
Highlights
  • The core idea behind the predominant pretrain and fine-tune paradigm for transfer learning in NLP is that general language knowledge, gleaned from large quantities of data using unsupervised objectives, can serve as a foundation for more specialized endeavors
  • In addition to its improved performance, based Autoencoder as Teacher (BAT) can apply straightforwardly to other models, because it makes very few assumptions about the base model—requiring only that it rely on a word-level reconstruction objective, which is true of the majority of neural topic models proposed to date
  • We illustrate this by using the Wasserstein auto-encoder (W-LDA) as a base neural topic model (NTM), showing in Table 3 that BAT improves on the unaugmented model
  • Consistent with prior work on automatic evaluation of topic models, differences in normalized mutual pointwise information (NPMI) do appear to correspond to recognizable subjective differences in topic quality
  • We are the first to distill a “blackbox” neural network teacher to guide a probabilistic graphical model. We do this in order to combine the expressivity of probabilistic topic models with the precision of pretrained transformers
  • We show that our adaptable framework improves performance in the aggregate over all estimated topics, as is commonly reported, and in head-to-head comparisons of aligned topics
  • Our modular method sits atop any neural topic model (NTM) to improve topic quality, which we demonstrate using two NTMs of highly disparate architectures (VAEs and WAEs), obtaining state-of-the-art topic coherence across three datasets from different domains
Methods
  • 2.1 Background on topic models

    Topic modeling is a well-established probabilistic method that aims to summarize large document corpora using a much smaller number of latent topics.
  • Neural topic models have capitalized on the VAE framework (Srivastava and Sutton, 2017; Card et al, 2018; Burkhardt and Kramer, 2019, inter alia) and other deep generative models (Wang et al, 2019; Nan et al, 2019)
  • In addition to their flexibility, the best models yield more coherent topics than LDA
Results
  • Results and Discussion

    Using the VAE-based SCHOLAR as the base model, topics discovered using BAT are more coherent, as measured via NPMI, than previous state-of-theart baseline NTMs (Table 2), improving on the DVAE and W-LDA baselines, and the baseline of SCHOLAR without the KD augmentation.
  • In addition to its improved performance, BAT can apply straightforwardly to other models, because it makes very few assumptions about the base model—requiring only that it rely on a word-level reconstruction objective, which is true of the majority of neural topic models proposed to date
  • The authors illustrate this by using the Wasserstein auto-encoder (W-LDA) as a base NTM, showing in Table 3 that BAT improves on the unaugmented model.14.
  • The authors report the dev set results in Appendix A—the same pattern of results is obtained, for all the models
Conclusion
  • Conclusions and Future Work

    To the knowledge, the authors are the first to distill a “blackbox” neural network teacher to guide a probabilistic graphical model.
  • The authors hope to explore the effects of the pretraining corpus (Gururangan et al, 2020) and teachers on the generated topics
  • Another intriguing direction is exploring the connection between the methods and neural network interpretability.
  • As the weight on the BERT autoencoder logits λ goes to one, the topic model begins to describe less the corpus and more the teacher
  • The authors believe mining this connection can open up further research avenues; for instance, by investigating the differences in such teacher-topics conditioned on the pre-training corpus.
  • The authors are motivated primarily by the widespread use of topic models for identifying interpretable topics (Boyd-Graber et al, 2017, Ch. 3), the authors plan to explore the ideas presented here further in the context of downstream applications like document classification
Summary
  • Introduction:

    The core idea behind the predominant pretrain and fine-tune paradigm for transfer learning in NLP is that general language knowledge, gleaned from large quantities of data using unsupervised objectives, can serve as a foundation for more specialized endeavors.
  • Current practice involves taking the full model that has amassed such general knowledge and fine-tuning it with a second objective appropriate to the new task
  • Using these methods, pre-trained transformer-based language models (e.g., BERT, Devlin et al, 2019) have been employed to great effect on a wide variety of NLP problems, thanks, in part, to a fine-grained ability to capture aspects of linguistic context (Clark et al, 2019; Liu et al, 2019; Rogers et al, 2020).
  • During backpropagation (Eq (3)), the topic parameters will only update to account for observed terms, which can lead to overfitting and topics with suboptimal coherence.
  • Methods:

    2.1 Background on topic models

    Topic modeling is a well-established probabilistic method that aims to summarize large document corpora using a much smaller number of latent topics.
  • Neural topic models have capitalized on the VAE framework (Srivastava and Sutton, 2017; Card et al, 2018; Burkhardt and Kramer, 2019, inter alia) and other deep generative models (Wang et al, 2019; Nan et al, 2019)
  • In addition to their flexibility, the best models yield more coherent topics than LDA
  • Results:

    Results and Discussion

    Using the VAE-based SCHOLAR as the base model, topics discovered using BAT are more coherent, as measured via NPMI, than previous state-of-theart baseline NTMs (Table 2), improving on the DVAE and W-LDA baselines, and the baseline of SCHOLAR without the KD augmentation.
  • In addition to its improved performance, BAT can apply straightforwardly to other models, because it makes very few assumptions about the base model—requiring only that it rely on a word-level reconstruction objective, which is true of the majority of neural topic models proposed to date
  • The authors illustrate this by using the Wasserstein auto-encoder (W-LDA) as a base NTM, showing in Table 3 that BAT improves on the unaugmented model.14.
  • The authors report the dev set results in Appendix A—the same pattern of results is obtained, for all the models
  • Conclusion:

    Conclusions and Future Work

    To the knowledge, the authors are the first to distill a “blackbox” neural network teacher to guide a probabilistic graphical model.
  • The authors hope to explore the effects of the pretraining corpus (Gururangan et al, 2020) and teachers on the generated topics
  • Another intriguing direction is exploring the connection between the methods and neural network interpretability.
  • As the weight on the BERT autoencoder logits λ goes to one, the topic model begins to describe less the corpus and more the teacher
  • The authors believe mining this connection can open up further research avenues; for instance, by investigating the differences in such teacher-topics conditioned on the pre-training corpus.
  • The authors are motivated primarily by the widespread use of topic models for identifying interpretable topics (Boyd-Graber et al, 2017, Ch. 3), the authors plan to explore the ideas presented here further in the context of downstream applications like document classification
Tables
  • Table1: Corpus statistics, which vary in total number of documents (D), vocabulary size (V ), and average document length (Nd)
  • Table2: The NPMI for our baselines (Section 3.2) compared with BAT (explained in Section 2.3) using SCHOLAR as our base neural architecture. We achieve better NPMI than all baselines across three datasets and K = 50, K = 200 topics. We use 5 random restarts and report the standard deviation
  • Table3: Mean NPMI (s.d.) across 5 runs for W-LDA (<a class="ref-link" id="cNan_et+al_2019_a" href="#rNan_et+al_2019_a">Nan et al, 2019</a>) and W-LDA+BAT for K = 50, showing improvement on two of three datasets. This demonstrates that our method is modular and can be used with base neural topic models that vary significantly in architecture
  • Table4: External NPMI (s.d.) for the base SCHOLAR and SCHOLAR+BAT. Models selected according to performance on the development set using internal NPMI
  • Table5: Selected examples of SCHOLAR+BAT improving on topics from SCHOLAR. We observe that the improved 20ng topic is more cleanly focused on the NHL (removing european, adding the Toronto Maple Leafs, evoking the Stanley Cup rather than the more generic ice); the improved wiki topic about typhoons is more clearly concentrated on meterological terms, rather than interspersing specific locations of typhoons (luzon, guam); and the improved IMDb topic more cleanly reflects what we would characterize as “video adaptations” by bringing in terms about that subject (book, books, novels, read) in place of predominant words relating to particular adaptations. Randomly selected examples can be found in Appendix G
  • Table6: The development-set NPMI for our baselines (Section 3.2) compared with BAT (explained in Section 2.3) using SCHOLAR as our base neural architecture. We achieve better NPMI than all baselines across three datasets and K = 50, K = 200 topics. We use 5 random restarts report the standard deviation
  • Table7: The mean development-set NPMI (std. dev.) across 5 runs for W-LDA and W-LDA+BAT for K = 50, showing improvement on all datasets. This demonstrates that our innovation is modular and can be used with base neural topic models that vary in architecture
  • Table8: Random forest classification accuracy on 20ng and IMDb datasets, using topic estimates from SCHOLAR and SCHOLAR + BAT
  • Table9: Mean development set NPMI (s.d.) across 5 runs for DVAE (Burkhardt and Kramer, 2019) and DVAE+BAT for K = 50
  • Table10: Effect on topic coherence of passing various document representations to the SCHOLAR encoder (using the IMDb data). Each setting describes the document representation provided to the encoder, which is transformed by one feed-forward layer of 300dimensions followed by a second down to K dimensions. “+ w2v” indicates that we first concatenated with the sum of the 300-dimensional word2vec embeddings for the document. Note that these early findings are based on a different IMDb development set, a 20% split from the training data. They are thus not directly comparable to the results reported elsewhere in the text, which used a separate held-out development set
  • Table11: Hyperparameter ranges and optimal values (as determined by development set NPMI) for SCHOLAR and SCHOLAR+BAT , on the 20NG dataset. lr is the learning rate, α is the hyperparameter for the logistic normal prior, λ is the weight on the teacher model logits from Eq (4), and T is the softmax temperature from Eq (4). Other hyperparamter values (which can be accessed in our code base) which were kept at their default values are not reported here. Values marked with the * are also kept at their default values per the base SCHOLAR model (https://github.com/dallascard/scholar). All different sweeps in the grid search were run for 500 epochs with a batch size = 200
  • Table12: Hyperparameter ranges and optimal values (as determined by development set NPMI) for SCHOLAR and SCHOLAR+BAT , on the Wiki dataset. lr is the learning rate, α is the hyperparameter for the logistic normal prior, anneal controls the annealing (as explained in Appendix B in Card et al (2018)), λ is the weight on the teacher model logits from Eq (4), T is the softmax temperature from Eq (4), and clipping controls how much of the logit distribution to clip (Section 2.3). Other hyperparamter values (which can be accessed in our code base) which were kept at their default values are not reported here. All different sweeps in the grid search were run for 500 epochs with a batch size = 500
  • Table13: Hyperparameter ranges and optimal values (as determined by development set NPMI) for SCHOLAR and SCHOLAR+BAT , on the IMDb dataset. lr is the learning rate, α is the hyperparameter for the logistic normal prior, anneal controls the annealing (as explained in Appendix B in Card et al (2018)), λ is the weight on the teacher model logits from Eq (4), T is the softmax temperature from Eq (4), and clipping controls how much of the logit distribution to clip (Section 2.3). Other hyperparamter values (which can be accessed in our code base) which were kept at their default values are not reported here. Values marked with the * are also kept at their default values per the base SCHOLAR model (https://github.com/dallascard/scholar). All different sweeps in the grid search were run for 500 epochs with a batch size = 200
  • Table14: Hyperparameter ranges and optimal values (as determined by development set NPMI) for W-LDA and W-LDA+BAT , on all three datasets. lr is the learning rate, α is the hyperparameter for the dirichlet prior, λ is the weight on the teacher model logits from Eq (4), T is the softmax temperature from Eq (4), and clipping controls how much of the logit distribution to clip (Section 2.3). Other hyperparameter values (which can be accessed in our codebase) which were kept at their default values in the original baseline code are not reported here (also see <a class="ref-link" id="cNan_et+al_2019_a" href="#rNan_et+al_2019_a"><a class="ref-link" id="cNan_et+al_2019_a" href="#rNan_et+al_2019_a">Nan et al (2019</a></a>) and https://github.com/awslabs/w-lda/). Values marked with the * are also kept at their default values. All different sweeps in the grid search were run for 500 epochs and noise parameter = 0.5 (see <a class="ref-link" id="cNan_et+al_2019_a" href="#rNan_et+al_2019_a"><a class="ref-link" id="cNan_et+al_2019_a" href="#rNan_et+al_2019_a">Nan et al (2019</a></a>)). For 20NG and IMDb, we used batch size = 200, and for Wiki, we used batch size = 360
  • Table15: For DVAE, we tried four values for the Dirichlet Prior (as per the values tried by the authors in Burkhardt and Kramer (2019)) - {0.01, 0.1, 0.2, 0.6} and report the optimal values corresponding to the dev set results (Table 2) and test set results (Table 6) in this table. Within the model variations available in the codebase for DVAE (https://github.com/sophieburkhardt/dirichlet-vae-topic-models) we use the Dirichlet VAE based on RSVI which is shown to give the highest NPMI scores in Burkhardt and Kramer (2019)
  • Table16: Fifteen aligned topic pairs from the 20NG dataset
  • Table17: Fifteen aligned topic pairs from the Wiki dataset
  • Table18: Fifteen aligned topic pairs from the IMDB dataset
Download tables as Excel
Related work
  • Integrating embeddings into topic models. A key goal in our use of knowledge distillation is to incorporate relationships between words that may not be well supported by the topic model’s input documents alone. Some previous topic models have sought to address this issue by incorporating external word information, including word senses (Ferrugento et al, 2016) and pretrained word embeddings (Hu and Tsujii, 2016; Yang et al, 2017; Xun et al, 2017; Ding et al, 2018). More recently, Bianchi et al (2020) have incorporated BERT embeddings into the encoder to improve

    20ng SCHOLAR SCHOLAR+BAT

    Wiki SCHOLAR SCHOLAR+BAT

    IMDb SCHOLAR SCHOLAR+BAT NPMI

    Topic nhl hockey player coach ice playoff team league stanley european nhl hockey player team coach playoff cup wings stanley leafs jtwc jma typhoon monsoon luzon geophysical pagasa guam cyclone southwestward jtwc jma typhoon meteorological intensification monsoon dissipating shear outflow trough adaptation version novel bbc versions jane kenneth handsome adaptations faithful adaptation novel book read books faithful bbc version versions novels topic coherence. (See Appendix D.1 for our own related experiments, which yielded mixed results.) We refer the reader to Dieng et al (2020) for an extensive and up-to-date overview.

    A limitation of these approaches is that they simply import general, non-corpus-specific word-level information. In contrast, representations from a pretrained transformer can benefit from both general language knowledge and corpus-dependent information, by way of the pretraining and fine-tuning regime. By regularizing toward representations conditioned on the document, we remain coherent relative to the topic model data. An additional key advantage for our method is that it involves only a slight change to the underlying topic model, rather than the specialized designs by the above methods.
Funding
  • This material is based upon work supported by the National Science Foundation under Grants 2031736 and 2008761 and by Amazon
Study subjects and analysis
standard English-language topicmodeling datasets: 3
• We introduce a novel coupling of the knowledge distillation technique with generative graphical models. • We construct knowledge-distilled neural topic models that achieve better topic coherence than their counterparts without distillation on three standard English-language topicmodeling datasets. • We demonstrate that our method is not only effective but modular, by improving topic coherence in a base state-of-the-art model by modifying only a few lines of code.3

readily available datasets: 3
3.1 Data and Metrics. We validate our approach using three readily available datasets that vary widely in domain, corpus and vocabulary size, and document length: 20 Newsgroups (20NG, Lang, 1995),7 Wikitext-103 (Wiki, Merity et al, 2017),8 and IMDb movie reviews (IMDb, Maas et al, 2011).9. These are commonly used in neural topic modeling, with preprocessed versions provided by various authors; see references in Table 1 for details

datasets: 3
Fig. 2 shows the JS-divergences for aligned topic pairs, for our three corpora. Based on visual inspection, we choose the 44 most aligned topic pairs as being meaningful for comparison; beyond this point, the topics do not bear a conceptual relationship (using the same threshold for the three datasets for simplicity). When we consider these conceptually related

datasets: 3
SCHOLAR SCHOLAR + BAT. topic pairs, we see that the model augmented with BAT has the topic with the higher NPMI value more often across all three datasets (Fig. 3). This means that BAT is not just producing improvements in the aggregate (Section 4): its effect can be interpreted more specifically as identifying the same space of topics generated by an existing model and, in most cases, improving the coherence of individual topics

aligned pairs: 15
We find that, consistent with prior work on automatic evaluation of topic models, differences in NPMI do appear to correspond to recognizable subjective differences in topic quality. So that readers may form their own judgments, Appendix G presents 15 aligned pairs for each corpus, selected randomly by stratifying across levels of alignment quality to create a fair sample to review. 6 Related Work

Reference
  • Nikolaos Aletras and Mark Stevenson. 2013. Evaluating topic coherence using distributional semantics. In Proceedings of the 10th International Conference on Computational Semantics (IWCS 2013) – Long Papers, pages 13–22, Potsdam, Germany. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Federico Bianchi, Silvia Terragni, and Dirk Hovy. 2020. Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence. arXiv:2004.03974 [cs].
    Findings
  • David M Blei. 200Latent Dirichlet Allocation. Journal of Machine Learning Research, page 30.
    Google ScholarLocate open access versionFindings
  • Jianhua Lin. 1991. Divergence measures based on the shannon entropy. IEEE Transactions on Information theory, 37(1):145–151.
    Google ScholarLocate open access versionFindings
  • Nelson F. Liu, Matt Gardner, Yonatan Belinkov, Matthew E. Peters, and Noah A. Smith. 2019. Linguistic knowledge and transferability of contextual representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1073–1094, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Xuan Liu, Xiaoguang Wang, and Stan Matwin. 2018a. Improving the Interpretability of Deep Neural Networks with Knowledge Distillation. In 2018 IEEE International Conference on Data Mining Workshops (ICDMW), pages 905–912. IEEE.
    Google ScholarLocate open access versionFindings
  • Yongcheng Liu, Lu Sheng, Jing Shao, Junjie Yan, Shiming Xiang, and Chunhong Pan. 2018b. MultiLabel Image Classification via Knowledge Distillation from Weakly-Supervised Detection. 2018 ACM Multimedia Conference on Multimedia Conference MM ’18, pages 700–708.
    Google ScholarLocate open access versionFindings
  • Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • I Dan Melamed. 2000. Models of translational equivalence among words. Computational Linguistics, 26(2):221–249.
    Google ScholarLocate open access versionFindings
  • Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2017. Pointer Sentinel Mixture Models. ICLR.
    Google ScholarLocate open access versionFindings
  • Yishu Miao, Lei Yu, and Phil Blunsom. 2016. Neural Variational Inference for Text Processing. ICML.
    Google ScholarFindings
  • Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. In 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings.
    Google ScholarLocate open access versionFindings
  • Feng Nan, Ran Ding, Ramesh Nallapati, and Bing Xiang. 2019. Topic modeling with Wasserstein autoencoders. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6345–6381, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Viet-An Nguyen, Jordan L Ying, and Philip Resnik. 2013. Lexical and hierarchical topic regression. In Advances in neural information processing systems, pages 1106–1114.
    Google ScholarLocate open access versionFindings
  • Robert Parker, David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda. 2011. English Gigaword Fifth Edition.
    Google ScholarFindings
  • F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.
    Google ScholarLocate open access versionFindings
  • Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2019. Exploring the Limits of Transfer Learning with a Unified Textto-Text Transformer. Journal of Machine Learning Research.
    Google ScholarLocate open access versionFindings
  • Anna Rogers, Olga Kovaleva, and Anna Rumshisky. 2020. A Primer in BERTology: What we know about how BERT works. arXiv:2002.12327 [cs].
    Findings
  • Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 20DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. NeurIPS EMC2ˆ Workshop.
    Google ScholarFindings
  • Akash Srivastava and Charles Sutton. 2017. Autoencoding variational inference for topic models. In ICLR.
    Google ScholarFindings
  • Haipeng Sun, Rui Wang, Kehai Chen, Masao Utiyama, Eiichiro Sumita, and Tiejun Zhao. 2020. Knowledge Distillation for Multilingual Unsupervised Neural Machine Translation. arXiv:2004.10171 [cs].
    Findings
  • Jiaxi Tang, Rakesh Shivanna, Zhe Zhao, Dong Lin, Anima Singh, Ed H. Chi, and Sagar Jain. 2020. Understanding and Improving Knowledge Distillation. arXiv:2002.03532 [cs, stat].
    Findings
  • Raphael Tang, Yao Lu, and Jimmy Lin. 2019a. Natural Language Generation for Effective Knowledge Distillation. In Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019), pages 202–208, Hong Kong, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, and Jimmy Lin. 2019b. Distilling Task-Specific Knowledge from BERT into Simple Neural Networks. arXiv:1903.12136 [cs].
    Findings
  • Ilya Tolstikhin, Sylvain Gelly, Olivier Bousquet, and Bernhard Scholkopf. 2018. Wasserstein AutoEncoders. ICLR, page 16.
    Google ScholarFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc.
    Google ScholarLocate open access versionFindings
  • Hanna M. Wallach, David M. Mimno, and Andrew McCallum. 2009. Rethinking LDA: Why Priors Matter. In Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 1973– 1981. Curran Associates, Inc.
    Google ScholarLocate open access versionFindings
  • Rui Wang, Xuemeng Hu, Deyu Zhou, Yulan He, Yuxuan Xiong, Chenchen Ye, and Haiyang Xu. 2020. Neural Topic Modeling with Bidirectional Adversarial Training. ACL, page 11.
    Google ScholarFindings
  • Rui Wang, Deyu Zhou, and Yulan He. 2019. ATM:Adversarial-neural Topic Model. Information Processing & Management, 56(6):102098.
    Google ScholarLocate open access versionFindings
  • Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R’emi Louf, Morgan Funtowicz, and Jamie Brew. 2019. HuggingFace’s transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771.
    Findings
  • Andrew KC Wong and Manlai You. 1985. Entropy and distance of random graphs with application to structural pattern recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, (5):599– 609.
    Google ScholarLocate open access versionFindings
  • Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Ł ukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv:1609.08144 [cs].
    Findings
  • Hongteng Xu, Wenlin Wang, Wei Liu, and Lawrence Carin. 2018. Distilled Wasserstein Learning for Word Embedding and Topic Modeling. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 1716– 1725. Curran Associates, Inc.
    Google ScholarLocate open access versionFindings
  • Guangxu Xun, Yaliang Li, Wayne Xin Zhao, Jing Gao, and Aidong Zhang. 2017. A correlated topic model using word embeddings. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, IJCAI’17, pages 4207–4213, Melbourne, Australia. AAAI Press.
    Google ScholarLocate open access versionFindings
  • Weiwei Yang, Jordan Boyd-Graber, and Philip Resnik. 2017. Adapting topic models using lexical associations with tree priors. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1901–1906.
    Google ScholarLocate open access versionFindings
  • Ziqing Yang, Yiming Cui, Zhipeng Chen, Wanxiang Che, Ting Liu, Shijin Wang, and Guoping Hu. 2020. TextBrewer: An Open-Source Knowledge Distillation Toolkit for Natural Language Processing. ACL Demo Session. 1763
    Google ScholarLocate open access versionFindings
Author
Alexander Hoyle
Alexander Hoyle
Pranav Goel
Pranav Goel
Philip Resnik
Philip Resnik
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科