# Scaling Hidden Markov Language Models

EMNLP 2020, pp. 1341-1349, 2020.

Weibo:

Abstract:

The hidden Markov model (HMM) is a fundamental tool for sequence modeling that cleanly separates the hidden state from the emission structure. However, this separation makes it difficult to fit HMMs to large datasets in modern NLP, and they have fallen out of use due to very poor performance compared to fully observed models. This work re...More

Code:

Data:

Introduction

- Hidden Markov models (HMMs) are a fundamental latent-variable model for sequential data, with a rich history in NLP.
- They have been used extensively in tasks such as tagging (Merialdo, 1994), alignment (Vogel et al, 1996), and even, in a few cases, language modeling (Kuhn et al, 1994; Huang, 2011).
- Hidden Markov models (HMMs) specify a joint distribution over observed tokens x and discrete latent states z = z1, .
- As the authors scale to large state spaces, the authors take advantage of compact neural parameterizations

Highlights

- Hidden Markov models (HMMs) are a fundamental latent-variable model for sequential data, with a rich history in NLP
- The VLHMM is still outperformed by LSTMs which have been extensively studied for this task
- This trend persists in WIKITEXT-2, with the very large neural HMM (VL-HMM) outperforming the FF model but underperforming an LSTM
- This work demonstrates methods for effectively scaling HMMs to large state spaces on parallel hardware, and shows that this approach results in accuracy gains compared to other HMM models
- We introduce three techniques: a blocked emission constraint, a neural parameterization, and state dropout, which lead to an HMM that outperforms n-gram models and prior HMMs
- HMMs are a useful class of probabilistic models with different inductive biases, performance characteristics, and conditional independence structure than recurrent neural network (RNN)

Methods

- HMMs on two language modeling datasets.
- The authors' approach allows them to train an HMM with tens of thousands of states while maintaining efficiency and significantly outperforming past HMMs as well as n-gram models.
- On PTB the FF model takes 3s per epoch, the LSTM 23s, and the VLHMM 215 433s.
- The inference for VLHMM was not heavily optimized, and uses a kernel produced by TVM (Chen et al, 2018) for computing gradients through marginal inference.
- The authors use the following residual network as the MLP: fi(E) = gi(ReLU(EWi1))

Results

- The VL-HMM outperforms the HMM+RNN extension of Buys et al (2018) (142.3)
- These results indicate that HMMs are a much stronger model on this benchmark than previously claimed.
- The VLHMM is still outperformed by LSTMs which have been extensively studied for this task.
- This trend persists in WIKITEXT-2, with the VL-HMM outperforming the FF model but underperforming an LSTM

Conclusion

- This work demonstrates methods for effectively scaling HMMs to large state spaces on parallel hardware, and shows that this approach results in accuracy gains compared to other HMM models.
- Future work includes using these approaches to induce model structure, develop accurate models with better interpretability, and to apply these approaches in lower data regimes

Summary

## Introduction:

Hidden Markov models (HMMs) are a fundamental latent-variable model for sequential data, with a rich history in NLP.- They have been used extensively in tasks such as tagging (Merialdo, 1994), alignment (Vogel et al, 1996), and even, in a few cases, language modeling (Kuhn et al, 1994; Huang, 2011).
- Hidden Markov models (HMMs) specify a joint distribution over observed tokens x and discrete latent states z = z1, .
- As the authors scale to large state spaces, the authors take advantage of compact neural parameterizations
## Methods:

HMMs on two language modeling datasets.- The authors' approach allows them to train an HMM with tens of thousands of states while maintaining efficiency and significantly outperforming past HMMs as well as n-gram models.
- On PTB the FF model takes 3s per epoch, the LSTM 23s, and the VLHMM 215 433s.
- The inference for VLHMM was not heavily optimized, and uses a kernel produced by TVM (Chen et al, 2018) for computing gradients through marginal inference.
- The authors use the following residual network as the MLP: fi(E) = gi(ReLU(EWi1))
## Results:

The VL-HMM outperforms the HMM+RNN extension of Buys et al (2018) (142.3)- These results indicate that HMMs are a much stronger model on this benchmark than previously claimed.
- The VLHMM is still outperformed by LSTMs which have been extensively studied for this task.
- This trend persists in WIKITEXT-2, with the VL-HMM outperforming the FF model but underperforming an LSTM
## Conclusion:

This work demonstrates methods for effectively scaling HMMs to large state spaces on parallel hardware, and shows that this approach results in accuracy gains compared to other HMM models.- Future work includes using these approaches to induce model structure, develop accurate models with better interpretability, and to apply these approaches in lower data regimes

- Table1: Perplexities on PTB / WIKITEXT-2. The HMM+RNN and HMM of <a class="ref-link" id="cBuys_et+al_2018_a" href="#rBuys_et+al_2018_a">Buys et al (2018</a>) reported validation perplexity only for PTB
- Table2: Ablations on PTB (λ = 0.5 and M = 128) with a smaller model |Z| = 214. Time is ms per eval batch (Run on RTX 2080). Ablations were performed independently, removing a single component per row. Removing the neural parameterization results in a scalar parameterization
- Table3: Emission constraint ablations on PENN TREEBANK. |Z| is the size of the hidden space, k is the size number of hidden states in each block, and M is the number of blocks
- Table4: Ablations on PTB (λ = 0.5 and M = 128). Param is the number of parameters, while train and val give the corresponding perplexities. Time is ms per eval batch (Run on RTX 2080)

Related work

- In order to improve the performance of HMMs on language modeling, several recent papers have combined HMMs with neural networks. Buys et al (2018) develop an approach to relax HMMs, but their models either perform poorly or alter the probabilistic structure to resemble an RNN. Krakovna and Doshi-Velez (2016) utilize model combination with an RNN to connect both approaches in a small state-space model. Our method instead focuses on scaling pure HMMs to a large number of states.

Prior work has also considered neural parameterizations of HMMs. Tran et al (2016) demonstrate improvements in POS induction with a neural parameterization of an HMM. They consider small state spaces, as the goal is tag induction rather than language modeling.1

Funding

- This work is supported by CAREER 2037519 and NSF III 1901030. Shuning Jin, Sam Wiseman, Karl Stratos, and Karen Livescu. 2020

Reference

- Yoshua Bengio, Rejean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A neural probabilistic language model. J. Mach. Learn. Res., 3:1137–1155.
- James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. 2016. Quasi-recurrent neural networks. CoRR, abs/1611.01576.
- Viktoriya Krakovna and Finale Doshi-Velez. 2016. Increasing the interpretability of recurrent neural networks using hidden markov models.
- Thomas Kuhn, Heinrich Niemann, and Ernst Gunter Schukat-Talamazzini. 199Ergodic hidden markov models and polygrams for language modeling. pages 357–360.
- Richard E. Ladner and Michael J. Fischer. 1980. Parallel prefix computation. J. ACM, 27(4):831–838.
- Peter F. Brown, Peter V. deSouza, Robert L. Mercer, Vincent J. Della Pietra, and Jenifer C. Lai. 1992. Class-based n-gram models of natural language. Comput. Linguist., 18(4):467–479.
- Jan Buys, Yonatan Bisk, and Yejin Choi. 2018. Bridging hmms and rnns through architectural transformations.
- Tianqi Chen, Thierry Moreau, Ziheng Jiang, Haichen Shen, Eddie Q. Yan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 201TVM: end-to-end optimization stack for deep learning. CoRR, abs/1802.04799.
- Percy Liang. 2005. Semi-supervised learning for natural language. In MASTER’S THESIS, MIT.
- Ilya Loshchilov and Frank Hutter. 2017. Fixing weight decay regularization in adam. CoRR, abs/1711.05101.
- Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313–330.
- Bernard Merialdo. 1994. Tagging English text with a probabilistic model. Computational Linguistics, 20(2):155–171.
- Antoine Dedieu, Nishad Gothoskar, Scott Swingle, Wolfgang Lehrach, Miguel Lazaro-Gredilla, and Dileep George. 2019. Learning higher-order sequential structure with cloned hmms.
- Wenjuan Han, Yong Jiang, and Kewei Tu. 2017. Dependency grammar induction with neural lexicalization and big training data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1683–1688, Copenhagen, Denmark. Association for Computational Linguistics.
- Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H. Clark, and Philipp Koehn. 2013. Scalable modified Kneser-Ney language model estimation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 690–696, Sofia, Bulgaria. Association for Computational Linguistics.
- Stephen Merity, Nitish Shirish Keskar, and Richard Socher. 2017. Regularizing and optimizing LSTM language models. CoRR, abs/1708.02182.
- Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer sentinel mixture models. CoRR, abs/1609.07843.
- T. Mikolov and G. Zweig. 2012. Context dependent recurrent neural network language model. In 2012 IEEE Spoken Language Technology Workshop (SLT), pages 234–239.
- Tomas Mikolov, Anoop Deoras, Stefan Kombrink, Lukas Burget, and Jan Cernocky. 2011. Empirical evaluation and combination of advanced language modeling techniques. pages 605–608.
- Slav Petrov, Leon Barrett, Romain Thibaux, and Dan Klein. 2006. Learning accurate, compact, and interpretable tree annotation. page 433–440.
- Zhongqiang Huang. 2011. Modeling Dependencies in Ke M. Tran, Yonatan Bisk, Ashish Vaswani, Daniel
- Natural Languages with Latent Variables. Ph.D. the- Marcu, and Kevin Knight. 2016. Unsupervised neusis, University of Maryland.
- Tim Vieira. 2014. Gumbel-max trick and weighted reservoir sampling.
- Stephan Vogel, Hermann Ney, and Christoph Tillmann. 1996. Hmm-based word alignment in statistical translation. In Proceedings of the 16th Conference on Computational Linguistics - Volume 2, COLING ’96, page 836–841, USA. Association for Computational Linguistics.
- Sam Wiseman, Stuart M. Shieber, and Alexander M. Rush. 2018. Learning neural templates for text generation. CoRR, abs/1808.10122.
- Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent neural network regularization. CoRR, abs/1409.2329.
- Brown clustering is an agglomerative clustering approach (Brown et al., 1992; Liang, 2005) that assigns every token type a single cluster. The Brown clustering model aims to find an HMM that maximizes the likelihood of an observed corpora under the constraint that every token type can only be emit by a single latent class. The cluster for the word is given by the latent class that emits that token type.
- Clusters are initialized by assigning every token type a unique latent state in an HMM. States are then merged iteratively until a desired number M is reached. Liang (2005) propose an algorithm that chooses a pair of states to merge at every iteration based on state bigram statistics within a window.
- model needed a larger batch size to achieve decent performance. For the LSTM, we use a batch size of 16 and a BPTT length of 32. For both baseline models we use AdamW (Loshchilov and Hutter, 2017) with a learning rate of 1e-3 and a dropout rate of 0.3 on the activations in the model. Both models use a hidden dimension of h = 256 throughout. These same hyperparameters were applied on both PENN TREEBANK and WIKITEXT-2.
- 2. Dropout λ ∈ {0, 0.25, 0.5, 0.75}
- 3. Hidden dimension h ∈ {128, 256, 512}
- 4. Batch size ∈ {16, 32, 64, 128} On PTB the FF model takes 3s per epoch, the LSTM 23s, and the VLHMM 215 433s. The inference for VLHMM was not heavily optimized, and uses a kernel produced by TVM (Chen et al., 2018) for computing gradients through marginal inference.

Full Text

Tags

Comments