On Importance Sampling-Based Evaluation of Latent Language Models

Robert L Logan IV
Robert L Logan IV

ACL, pp. 2171-2176, 2020.

Cited by: 0|Bibtex|Views103
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com
Weibo:
We investigate the application of importance sampling to evaluating latent language models

Abstract:

Language models that use additional latent structures (e.g., syntax trees, coreference chains, and knowledge graph links) provide several advantages over traditional language models. However, likelihood-based evaluation of these models is often intractable as it requires marginalizing over the latent space. Existing methods avoid this iss...More

Code:

Data:

0
Introduction
  • Latent language models are generative models of text that jointly represent the text and the latent structure underlying it, such as: the syntactic parse, coreference chains between entity mentions, or links of entities and relations mentioned in the text to an external knowledge graph.
  • The benefits of modeling such structure include interpretability (Hayashi et al, 2020), better performance on tasks requiring structure (Dyer et al, 2016; Ji et al, 2017), and improved ability to generate consistent mentions of entities (Clark et al, 2018) and factually accurate text (Logan et al, 2019)
  • Demonstrating that these models provide better performance than traditional language models by evaluating their likelihood on benchmark data can be difficult, as exact computation requires marginalizing over all possible latent structures.
  • These works employ a variety of heuristics—such as sampling from proposal distributions that are conditioned on future gold tokens the model is being evaluated on, and changing the temperature of the proposal distribution—without providing measurements of the effect these decisions have on estimated perplexity, and often omitting details crucial to replicating their results
Highlights
  • Latent language models are generative models of text that jointly represent the text and the latent structure underlying it, such as: the syntactic parse, coreference chains between entity mentions, or links of entities and relations mentioned in the text to an external knowledge graph
  • The benefits of modeling such structure include interpretability (Hayashi et al, 2020), better performance on tasks requiring structure (Dyer et al, 2016; Ji et al, 2017), and improved ability to generate consistent mentions of entities (Clark et al, 2018) and factually accurate text (Logan et al, 2019). Demonstrating that these models provide better performance than traditional language models by evaluating their likelihood on benchmark data can be difficult, as exact computation requires marginalizing over all possible latent structures
  • Convergence of importance sampled estimates is asymptotically guaranteed, results are typically produced using a small number of samples for which this guarantee does not necessarily apply. These works employ a variety of heuristics—such as sampling from proposal distributions that are conditioned on future gold tokens the model is being evaluated on, and changing the temperature of the proposal distribution—without providing measurements of the effect these decisions have on estimated perplexity, and often omitting details crucial to replicating their results
  • We investigate the application of importance sampling to evaluating latent language models
  • Our contributions include: (1) showing that importance sampling produces stochastic upper bounds of perplexity, thereby justifying the use of such estimates for comparing language model performance, (2) a concise description of common practices used in applying this technique, (3) a simple direct marginalization-based alternative to importance sampling, and (4) experimental results demonstrating the effect of sample size, sampling distribution, and granularity on estimates
Methods
  • Experiments For EntityNLM and

    KGLM, the authors experiment with two kinds of proposal distributions: (1) the standard peeking proposal distribution that conditions on future evaluation data, and (2) a non-peeking variant that is conditioned only on the data observed by the model.
  • For the peeking proposal distribution, the authors experiment with applying temperatures τ ∈ [0.5, 0.9, 1.0, 1.1, 2.0, 5.0].
  • The authors report both corpus-level and instance-level estimates, as well as bounds produced using a direct, beam marginalization method the authors describe later.
  • None of the curves exhibit any signs of convergence even after drawing orders of magnitude more samples (Figure 3); the estimated model perplexities continue to improve.
  • The performance of these models is likely better than the originally reported estimates
Conclusion
  • The authors investigate the application of importance sampling to evaluating latent language models.
  • While this work helps clarify and validate existing results, the authors observe that none of the estimates appear to converge even after drawing large numbers of samples.
  • The authors encourage future research into obtaining tighter bounds on latent LM perplexity, possibly by using more powerful proposal distributions that consider entire documents as context, or by considering methods such as annealed importance sampling
Summary
  • Introduction:

    Latent language models are generative models of text that jointly represent the text and the latent structure underlying it, such as: the syntactic parse, coreference chains between entity mentions, or links of entities and relations mentioned in the text to an external knowledge graph.
  • The benefits of modeling such structure include interpretability (Hayashi et al, 2020), better performance on tasks requiring structure (Dyer et al, 2016; Ji et al, 2017), and improved ability to generate consistent mentions of entities (Clark et al, 2018) and factually accurate text (Logan et al, 2019)
  • Demonstrating that these models provide better performance than traditional language models by evaluating their likelihood on benchmark data can be difficult, as exact computation requires marginalizing over all possible latent structures.
  • These works employ a variety of heuristics—such as sampling from proposal distributions that are conditioned on future gold tokens the model is being evaluated on, and changing the temperature of the proposal distribution—without providing measurements of the effect these decisions have on estimated perplexity, and often omitting details crucial to replicating their results
  • Methods:

    Experiments For EntityNLM and

    KGLM, the authors experiment with two kinds of proposal distributions: (1) the standard peeking proposal distribution that conditions on future evaluation data, and (2) a non-peeking variant that is conditioned only on the data observed by the model.
  • For the peeking proposal distribution, the authors experiment with applying temperatures τ ∈ [0.5, 0.9, 1.0, 1.1, 2.0, 5.0].
  • The authors report both corpus-level and instance-level estimates, as well as bounds produced using a direct, beam marginalization method the authors describe later.
  • None of the curves exhibit any signs of convergence even after drawing orders of magnitude more samples (Figure 3); the estimated model perplexities continue to improve.
  • The performance of these models is likely better than the originally reported estimates
  • Conclusion:

    The authors investigate the application of importance sampling to evaluating latent language models.
  • While this work helps clarify and validate existing results, the authors observe that none of the estimates appear to converge even after drawing large numbers of samples.
  • The authors encourage future research into obtaining tighter bounds on latent LM perplexity, possibly by using more powerful proposal distributions that consider entire documents as context, or by considering methods such as annealed importance sampling
Tables
  • Table1: Final perplexity estimates using different proposal distributions, estimated at both the instance and corpus level. τ is temperature, and No Peeking refers to proposal distributions that are not conditioned on future outputs
  • Table2: Strict perplexity upper bounds obtained by marginalizing over the top-k states predicted by q(z|x) using beam search
Download tables as Excel
Funding
  • This work was funded in part by Allen Institute of Artificial Intelligence, the NSF award #IIS-1817183, and in part by the DARPA MCS program under contract No N660011924033 with the United States Office of Naval Research
Reference
  • Sungjin Ahn, Heeyoul Choi, Tanel Pärnamaa, and Yoshua Bengio. 2016. A neural knowledge language model. arXiv preprint arXiv:1608.00318.
    Findings
  • Elizabeth Clark, Yangfeng Ji, and Noah A. Smith. 2018. Neural text generation in stories using entity representations as context. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2250–2260, New Orleans, Louisiana. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A. Smith. 2016. Recurrent neural network grammars. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 199–209, San Diego, California. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • John Geweke. 1989. Bayesian inference in econometric models using monte carlo integration. Econometrica, 57(6):1317–1339.
    Google ScholarLocate open access versionFindings
  • Hiroaki Hayashi, Zecong Hu, Chenyan Xiong, and Graham Neubig. 2020. Latent relation language models. In Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI), New York, USA.
    Google ScholarLocate open access versionFindings
  • Yangfeng Ji, Chenhao Tan, Sebastian Martschat, Yejin Choi, and Noah A. Smith. 2017. Dynamic entity representations in neural language models. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1830– 1839, Copenhagen, Denmark. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Herman Kahn. 1950. Random sampling (monte carlo) techniques in neutron attenuation problems–i. Nucleonics, 6(5):27–passim.
    Google ScholarLocate open access versionFindings
  • Yoon Kim, Alexander Rush, Lei Yu, Adhiguna Kuncoro, Chris Dyer, and Gábor Melis. 2019. Unsupervised recurrent neural network grammars. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1105–1117, Minneapolis, Minnesota. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Robert Logan, Nelson F. Liu, Matthew E. Peters, Matt Gardner, and Sameer Singh. 201Barack’s wife hillary: Using knowledge graphs for fact-aware language modeling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5962–5971, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Mitchell Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated corpus of english: The penn treebank.
    Google ScholarFindings
  • Art B. Owen. 2013. Monte Carlo theory, methods and examples.
    Google ScholarFindings
  • Sameer Pradhan, Xiaoqiang Luo, Marta Recasens, Eduard Hovy, Vincent Ng, and Michael Strube. 2014. Scoring coreference partitions of predicted mentions: A reference implementation. In Proceedings of the conference. Association for Computational Linguistics. Meeting, volume 2014, page 30. NIH Public Access.
    Google ScholarLocate open access versionFindings
  • Aad W Van der Vaart. 2000. Asymptotic statistics, volume 3. Cambridge university press.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments