# A Scalable Hierarchical Distributed Language Model

NIPS, pp.1081-1088, (2008)

EI

摘要

Neural probabilistic language models (NPLMs) have been shown to be competi- tive with and occasionally superior to the widely-used n-gram language models. The main drawback of NPLMs is their extremely long training and testing times. Morin and Bengio have proposed a hierarchical language model built around a binary tree of words, which wa...更多

代码：

数据：

简介

- Statistical language modelling is concerned with building probabilistic models of word sequences.
- The vast majority of statistical language models are based on the Markov assumption, which states that the distribution of a word depends only on some fixed number of words that immediately precede it
- While this assumption is clearly false, it is very convenient because it reduces the problem of modelling the probability distribution of word sequences of arbitrary length to the problem of modelling the distribution on the word given some fixed number of preceding words, called the context.
- While it can improve n-gram performance, this approach introduces a very rigid kind of similarity, since each word typically belongs to exactly one class

重点内容

- Statistical language modelling is concerned with building probabilistic models of word sequences
- The vast majority of statistical language models are based on the Markov assumption, which states that the distribution of a word depends only on some fixed number of words that immediately precede it
- The main reason for this behavior is the fact that classical n-gram models are essentially conditional probability tables where different entries are estimated independently of each other
- After training the hierarchical log-bilinear model model, we summarize each context w1:n−1 with the predicted feature vector produced from it using Eq 1
- We have presented a simple and fast feature-based algorithm for automatic construction of such hierarchies

结果

- The authors compared the performance of the models on the APNews dataset containing the Associated Press news stories from 1995 and 1996.
- The vocabulary size for this dataset is 17964.
- The authors chose this dataset because it had already been used to compare the performance of neural models to that of n-gram models in [1] and [9], which allowed them to compare the results to the results in those papers.
- All models were compared based on their perplexity score on the test set

结论

**Discussion and future work**

The authors have demonstrated that a hierarchal neural language model can outperform its nonhierarchical counterparts and achieve state-of-the-art performance.- The key to making a hierarchical model perform well is using a carefully constructed hierarchy over words.
- Creating hierarchies in which every word occurred more than once was essential to getting the models to perform better.
- The failure to use multiple codes for words with several very different senses is probably a consequence of summarizing the distribution over contexts with a single mean feature vector when clustering words.
- The “sense multimodality” of context distributions would be better captured by using a small set of feature vectors found by clustering the contexts

- Table1: Trees of words generated by the feature-based algorithm. The mean code length is the sum of lengths of codes associated with a word, averaged over the distribution of the words in the training data. The run-time complexity of the hierarchical model is linear in the mean code length of the tree used. The mean number of codes per word refers to the number of codes per word averaged over the training data distribution. Since each non-leaf node in a tree has its own feature vector, the number of free parameters associated with the tree is linear in this quantity
- Table2: The effect of the feature dimensionality and the word tree used on the test set perplexity of the model
- Table3: Test set perplexity results for the hierarchical LBL models. All the distributed models in the comparison used 100-dimensional feature vectors and a context size of 5. LBL is the nonhierarchical log-bilinear model. KNn is a Kneser-Ney n-gram model. The scores for LBL, KN3, and KN5 are from [<a class="ref-link" id="c9" href="#r9">9</a>]. The timing for LBL is based on our implementation of the model

基金

- This research was supported by NSERC and CFI

研究对象与分析

children: 2

Each word corresponds to a leaf in the tree and can be uniquely specified by the path from the root to that leaf. If N is the number of words in the vocabulary and the tree is balanced, any word can be specified by a sequence of O(log N ) binary decisions indicating which of the two children of the current node is to be visited next. This setup replaces one N -way choice by a sequence of O(log N ) binary choices

引用论文

- Yoshua Bengio, Rejean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model. Journal of Machine Learning Research, 3:1137–1155, 2003.
- Yoshua Bengio and Jean-Sebastien Senecal. Quick training of probabilistic neural nets by importance sampling. In AISTATS’03, 2003.
- P.F. Brown, R.L. Mercer, V.J. Della Pietra, and J.C. Lai. Class-based n-gram models of natural language. Computational Linguistics, 18(4):467–479, 1992.
- Stanley F. Chen and Joshua Goodman. An empirical study of smoothing techniques for language modeling. In Proceedings of the Thirty-Fourth Annual Meeting of the Association for Computational Linguistics, pages 310–318, San Francisco, 1996.
- Ahmad Emami, Peng Xu, and Frederick Jelinek. Using a connectionist model in a syntactical based language model. In Proceedings of ICASSP, volume 1, pages 372–375, 2003.
- C. Fellbaum et al. WordNet: an electronic lexical database. Cambridge, Mass: MIT Press, 1998.
- J. Goodman. A bit of progress in language modeling. Technical report, Microsoft Research, 2000.
- John G. McMahon and Francis J. Smith. Improving statistical language model performance with automatically generated word hierarchies. Computational Linguistics, 22(2):217–247, 1996.
- A. Mnih and G. Hinton. Three new graphical models for statistical language modelling. Proceedings of the 24th international conference on Machine learning, pages 641–648, 2007.
- Frederic Morin and Yoshua Bengio. Hierarchical probabilistic neural network language model. In Robert G. Cowell and Zoubin Ghahramani, editors, AISTATS’05, pages 246–252, 2005.
- F. Pereira, N. Tishby, and L. Lee. Distributional clustering of English words. Proceedings of the 31st conference on Association for Computational Linguistics, pages 183–190, 1993.
- Holger Schwenk and Jean-Luc Gauvain. Connectionist language modeling for large vocabulary continuous speech recognition. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing, pages 765–768, 2002.

标签

评论

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn