Character-Aware Neural Language Models

AAAI, pp. 2741-2749, 2016.

Cited by: 1270|Bibtex|Views303
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
We propose a language model that leverages subword information through a character-level convolutional neural network, whose output is used as an input to a recurrent neural network language model

Abstract:

We describe a simple neural language model that relies only on character-level inputs. Predictions are still made at the word-level. Our model employs a convolutional neural network (CNN) and a highway network over characters, whose output is given to a long short-term memory (LSTM) recurrent neural network language model (RNN-LM). On t...More

Code:

Data:

0
Introduction
  • Language modeling is a fundamental task in artificial intelligence and natural language processing (NLP), with applications in speech recognition, text generation, and machine translation.
  • The count-based models are simple to train, but probabilities of rare n-grams can be poorly estimated due to data sparsity.
  • Neural Language Models (NLM) address the n-gram data sparsity issue through parameterization of words as vectors and using them as inputs to a neural network (Bengio, Ducharme, and Vincent 2003; Mikolov et al 2010).
  • Word embeddings obtained through NLMs exhibit the property whereby semantically close words are likewise close in the induced vector space (as is the case with nonneural techniques such as Latent Semantic Analysis (Deerwester, Dumais, and Harshman 1990))
Highlights
  • Language modeling is a fundamental task in artificial intelligence and natural language processing (NLP), with applications in speech recognition, text generation, and machine translation
  • Word embeddings obtained through Neural Language Models exhibit the property whereby semantically close words are likewise close in the induced vector space (as is the case with nonneural techniques such as Latent Semantic Analysis (Deerwester, Dumais, and Harshman 1990))
  • We propose a language model that leverages subword information through a character-level convolutional neural network (CNN), whose output is used as an input to a recurrent neural network language model (RNN-LM)
  • The input at time t is an output from a character-level convolutional neural network (CharCNN), which we describe
  • While our model requires additional convolution operations over characters and is slower than a comparable word-level model which can perform a simple lookup at the input layer, we found that the difference was manageable with optimized GPU implementations—for example on Penn Treebank the large character-level model trained at 1500 tokens/sec compared to the word-level model which trained at 3000 tokens/sec
  • We have introduced a neural language model that utilizes only character-level inputs
Methods
  • As is standard in language modeling, the authors use perplexity (P P L) to evaluate the performance of the models.
  • Perplexity of a model over a sequence [w1, .
  • 10 k 51 1 m 46 k 101 1 m 37 k 74 1 m 27 k 72 1 m 25 k 76 1 m 62 k 62 1 m 86 k 132 4 m DATA-L 60 k 197 20 m 206 k 195 17 m 339 k 260 51 m
Results
  • English Penn Treebank

    The authors train two versions of the model to assess the trade-off between performance and size.
  • Architecture of the small (LSTM-Char-Small) and large (LSTM-Char-Large) models is summarized in Table 2.
  • As another baseline, the authors train two comparable LSTM models that use word embeddings only (LSTM-Word-Small, LSTM-Word-Large).
  • Word embedding sizes are 200 and 650 respectively.
  • These were chosen to keep the number of parameters similar to the corresponding character-level model
Conclusion
  • The authors explore the word representations learned by the models on the PTB.
  • After highway layers the nearest neighbor of you is we, which is orthographically distinct from you
  • Another example is while and though— these words are far apart edit distance-wise yet the composition model is able to place them near each other.
  • Analysis of word representations obtained from the character composition part of the model further indicates that the model is able to encode, from characters only, rich semantic and orthographic features.
  • Using the CharCNN and highway layers for representation learning (e.g. as input into word2vec (Mikolov et al 2013)) remains an avenue for future work
Summary
  • Introduction:

    Language modeling is a fundamental task in artificial intelligence and natural language processing (NLP), with applications in speech recognition, text generation, and machine translation.
  • The count-based models are simple to train, but probabilities of rare n-grams can be poorly estimated due to data sparsity.
  • Neural Language Models (NLM) address the n-gram data sparsity issue through parameterization of words as vectors and using them as inputs to a neural network (Bengio, Ducharme, and Vincent 2003; Mikolov et al 2010).
  • Word embeddings obtained through NLMs exhibit the property whereby semantically close words are likewise close in the induced vector space (as is the case with nonneural techniques such as Latent Semantic Analysis (Deerwester, Dumais, and Harshman 1990))
  • Methods:

    As is standard in language modeling, the authors use perplexity (P P L) to evaluate the performance of the models.
  • Perplexity of a model over a sequence [w1, .
  • 10 k 51 1 m 46 k 101 1 m 37 k 74 1 m 27 k 72 1 m 25 k 76 1 m 62 k 62 1 m 86 k 132 4 m DATA-L 60 k 197 20 m 206 k 195 17 m 339 k 260 51 m
  • Results:

    English Penn Treebank

    The authors train two versions of the model to assess the trade-off between performance and size.
  • Architecture of the small (LSTM-Char-Small) and large (LSTM-Char-Large) models is summarized in Table 2.
  • As another baseline, the authors train two comparable LSTM models that use word embeddings only (LSTM-Word-Small, LSTM-Word-Large).
  • Word embedding sizes are 200 and 650 respectively.
  • These were chosen to keep the number of parameters similar to the corresponding character-level model
  • Conclusion:

    The authors explore the word representations learned by the models on the PTB.
  • After highway layers the nearest neighbor of you is we, which is orthographically distinct from you
  • Another example is while and though— these words are far apart edit distance-wise yet the composition model is able to place them near each other.
  • Analysis of word representations obtained from the character composition part of the model further indicates that the model is able to encode, from characters only, rich semantic and orthographic features.
  • Using the CharCNN and highway layers for representation learning (e.g. as input into word2vec (Mikolov et al 2013)) remains an avenue for future work
Tables
  • Table1: Corpus statistics. |V| = word vocabulary size; |C| = character vocabulary size; T = number of tokens in training set. The small English data is from the Penn Treebank and the Arabic data is from the News-Commentary corpus. The rest are from the 2013 ACL Workshop on Machine Translation. |C| is large because of (rarely occurring) special characters
  • Table2: Architecture of the small and large models. d = dimensionality of character embeddings; w = filter widths; h = number of filter matrices, as a function of filter width (so the large model has filters of width [1, 2, 3, 4, 5, 6, 7] of size [50, 100, 150, 200, 200, 200, 200] for a total of 1100 filters); f, g = nonlinearity functions; l = number of layers; m = number of hidden units
  • Table3: Performance of our model versus other neural language models on the English Penn Treebank test set. P P L refers to perplexity (lower is better) and size refers to the approximate number of parameters in the model. KN-5 is a Kneser-Ney 5-gram language model which serves as a non-neural baseline. †For these models the authors did not explicitly state the number of parameters, and hence sizes shown here are estimates based on our understanding of their papers or private correspondence with the respective authors
  • Table4: Test set perplexities for DATA-S. First two rows are from <a class="ref-link" id="cBotha_2014_a" href="#rBotha_2014_a"><a class="ref-link" id="cBotha_2014_a" href="#rBotha_2014_a">Botha (2014</a></a>) (except on Arabic where we trained our own KN-4 model) while the last six are from this paper. KN-4 is a Kneser-Ney 4-gram language model, and MLBL is the best performing morphological logbilinear model from <a class="ref-link" id="cBotha_2014_a" href="#rBotha_2014_a"><a class="ref-link" id="cBotha_2014_a" href="#rBotha_2014_a">Botha (2014</a></a>). Small/Large refer to model size (see Table 2), and Word/Morph/Char are models with words/morphemes/characters as inputs respectively
  • Table5: Test set perplexities on DATA-L. First two rows are from <a class="ref-link" id="cBotha_2014_a" href="#rBotha_2014_a"><a class="ref-link" id="cBotha_2014_a" href="#rBotha_2014_a">Botha (2014</a></a>), while the last three rows are from the small LSTM models described in the paper. KN4 is a Kneser-Ney 4-gram language model, and MLBL is the best performing morphological logbilinear model from <a class="ref-link" id="cBotha_2014_a" href="#rBotha_2014_a"><a class="ref-link" id="cBotha_2014_a" href="#rBotha_2014_a">Botha (2014</a></a>). Word/Morph/Char are models with words/morphemes/characters as inputs respectively
  • Table6: Nearest neighbor words (based on cosine similarity) of word representations from the large word-level and characterlevel (before and after highway layers) models trained on the PTB. Last three words are OOV words, and therefore they do not have representations in the word-level model
  • Table7: Perplexity on the Penn Treebank for small/large models trained with/without highway layers
  • Table8: Perplexity reductions by going from small wordlevel to character-level models based on different corpus/vocabulary sizes on German (DE). |V| is the vocabulary size and T is the number of tokens in the training set. The full vocabulary of the 1m dataset was less than 100k and hence that scenario is unavailable
Download tables as Excel
Related work
Funding
  • Describes a simple neural language model that relies only on character-level inputs
  • Proposes a language model that leverages subword information through a character-level convolutional neural network , whose output is used as an input to a recurrent neural network language model
  • Has described the process by which one feature is obtained from one filter matrix
  • Could replace xk with yk at each t in the RNN-LM, and as shows later, this simple model performs well on its own
Reference
  • Alexandrescu, A., and Kirchhoff, K. 2006. Factored Neural Language Models. In Proceedings of NAACL.
    Google ScholarLocate open access versionFindings
  • Ballesteros, M.; Dyer, C.; and Smith, N. A. 2015. Improved Transition-Based Parsing by Modeling Characters instead of Words with LSTMs. In Proceedings of EMNLP.
    Google ScholarLocate open access versionFindings
  • Bengio, Y.; Ducharme, R.; and Vincent, P. 200A Neural Probabilistic Language Model. Journal of Machine Learning Research 3:1137–1155.
    Google ScholarLocate open access versionFindings
  • Bengio, Y.; Simard, P.; and Frasconi, P. 199Learning Long-term Dependencies with Gradient Descent is Difficult. IEEE Transactions on Neural Networks 5:157–166.
    Google ScholarLocate open access versionFindings
  • Bilmes, J., and Kirchhoff, K. 2003. Factored Language Models and Generalized Parallel Backoff. In Proceedings of NAACL.
    Google ScholarLocate open access versionFindings
  • Botha, J., and Blunsom, P. 2014. Compositional Morphology for Word Representations and Language Modelling. In Proceedings of ICML.
    Google ScholarLocate open access versionFindings
  • Botha, J. 2014. Probabilistic Modelling of Morphologically Rich Languages. DPhil Dissertation, Oxford University.
    Google ScholarFindings
  • Chen, S., and Goodman, J. 199An Empirical Study of Smoothing Techniques for Language Modeling. Technical Report, Harvard University.
    Google ScholarFindings
  • Cheng, W. C.; Kok, S.; Pham, H. V.; Chieu, H. L.; and Chai, K. M. 2014. Language Modeling with Sum-Product Networks. In Proceedings of INTERSPEECH.
    Google ScholarLocate open access versionFindings
  • Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In Proceedings of EMNLP.
    Google ScholarLocate open access versionFindings
  • Collobert, R.; Weston, J.; Bottou, L.; Karlen, M.; Kavukcuoglu, K.; and Kuksa, P. 20Natural Language Processing (almost) from Scratch. Journal of Machine Learning Research 12:2493–2537.
    Google ScholarLocate open access versionFindings
  • Creutz, M., and Lagus, K. 2007. Unsupervised Models for Morpheme Segmentation and Morphology Learning. In Proceedings of the ACM Transations on Speech and Language Processing.
    Google ScholarLocate open access versionFindings
  • Deerwester, S.; Dumais, S.; and Harshman, R. 1990. Indexing by Latent Semantic Analysis. Journal of American Society of Information Science 41:391–407.
    Google ScholarLocate open access versionFindings
  • dos Santos, C. N., and Guimaraes, V. 2015. Boosting Named Entity Recognition with Neural Character Embeddings. In Proceedings of ACL Named Entities Workshop.
    Google ScholarLocate open access versionFindings
  • dos Santos, C. N., and Zadrozny, B. 2014. Learning Characterlevel Representations for Part-of-Speech Tagging. In Proceedings of ICML.
    Google ScholarLocate open access versionFindings
  • Graves, A. 2013. Generating Sequences with Recurrent Neural Networks. arXiv:1308.0850.
    Findings
  • Hinton, G.; Srivastava, N.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. 2012. Improving Neural Networks by Preventing Co-Adaptation of Feature Detectors. arxiv:1207.0580.
    Findings
  • Hochreiter, S., and Schmidhuber, J. 1997. Long Short-Term Memory. Neural Computation 9:1735–1780.
    Google ScholarLocate open access versionFindings
  • Kalchbrenner, N.; Grefenstette, E.; and Blunsom, P. 2014. A Convolutional Neural Network for Modelling Sentences. In Proceedings of ACL.
    Google ScholarLocate open access versionFindings
  • Kim, Y. 2014. Convolutional Neural Networks for Sentence Classification. In Proceedings of EMNLP.
    Google ScholarLocate open access versionFindings
  • Krizhevsky, A.; Sutskever, I.; and Hinton, G. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of NIPS.
    Google ScholarLocate open access versionFindings
  • LeCun, Y.; Boser, B.; Denker, J. S.; Henderson, D.; Howard, R. E.; Hubbard, W.; and Jackel, L. D. 1989. Handwritten Digit Recognition with a Backpropagation Network. In Proceedings of NIPS.
    Google ScholarLocate open access versionFindings
  • Lei, T.; Barzilay, R.; and Jaakola, T. 2015. Molding CNNs for Text: Non-linear, Non-consecutive Convolutions. In Proceedings of EMNLP.
    Google ScholarLocate open access versionFindings
  • Ling, W.; Lui, T.; Marujo, L.; Astudillo, R. F.; Amir, S.; Dyer, C.; Black, A. W.; and Trancoso, I. 2015. Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation. In Proceedings of EMNLP.
    Google ScholarLocate open access versionFindings
  • Luong, M.-T.; Socher, R.; and Manning, C. 2013. Better Word Representations with Recursive Neural Networks for Morphology. In Proceedings of CoNLL.
    Google ScholarLocate open access versionFindings
  • Marcus, M.; Santorini, B.; and Marcinkiewicz, M. 1993. Building a Large Annotated Corpus of English: the Penn Treebank. Computational Linguistics 19:331–330.
    Google ScholarLocate open access versionFindings
  • Mikolov, T., and Zweig, G. 2012. Context Dependent Recurrent Neural Network Language Model. In Proceedings of SLT.
    Google ScholarLocate open access versionFindings
  • Mikolov, T.; Karafiat, M.; Burget, L.; Cernocky, J.; and Khudanpur, S. 2010. Recurrent Neural Network Based Language Model. In Proceedings of INTERSPEECH.
    Google ScholarLocate open access versionFindings
  • Mikolov, T.; Deoras, A.; Kombrink, S.; Burget, L.; and Cernocky, J. 2011. Empirical Evaluation and Combination of Advanced Language Modeling Techniques. In Proceedings of INTERSPEECH.
    Google ScholarLocate open access versionFindings
  • Mikolov, T.; Sutskever, I.; Deoras, A.; Le, H.-S.; Kombrink, S.; and Cernocky, J. 2012. Subword Language Modeling with Neural Networks. preprint: www.fit.vutbr.cz/imikolov/rnnlm/char.pdf.
    Findings
  • Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J. 2013. Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781.
    Findings
  • Mnih, A., and Hinton, G. 2007. Three New Graphical Models for Statistical Language Modelling. In Proceedings of ICML.
    Google ScholarLocate open access versionFindings
  • Morin, F., and Bengio, Y. 2005. Hierarchical Probabilistic Neural Network Language Model. In Proceedings of AISTATS.
    Google ScholarLocate open access versionFindings
  • Pascanu, R.; Culcehre, C.; Cho, K.; and Bengio, Y. 2013. How to Construct Deep Neural Networks. arXiv:1312.6026.
    Findings
  • Qui, S.; Cui, Q.; Bian, J.; and Gao, B. 2014. Co-learning of Word Representations and Morpheme Representations. In Proceedings of COLING.
    Google ScholarLocate open access versionFindings
  • Shen, Y.; He, X.; Gao, J.; Deng, L.; and Mesnil, G. 2014. A Latent Semantic Model with Convolutional-pooling Structure for Information Retrieval. In Proceedings of CIKM.
    Google ScholarLocate open access versionFindings
  • Srivastava, R. K.; Greff, K.; and Schmidhuber, J. 2015. Training Very Deep Networks. arXiv:1507.06228.
    Findings
  • Sundermeyer, M.; Schluter, R.; and Ney, H. 2012. LSTM Neural Networks for Language Modeling.
    Google ScholarFindings
  • Sutskever, I.; Martens, J.; and Hinton, G. 2011. Generating Text with Recurrent Neural Networks.
    Google ScholarFindings
  • Sutskever, I.; Vinyals, O.; and Le, Q. 2014. Sequence to Sequence Learning with Neural Networks.
    Google ScholarFindings
  • Wang, M.; Lu, Z.; Li, H.; Jiang, W.; and Liu, Q. 2015. genCNN: A Convolutional Architecture for Word Sequence Prediction. In Proceedings of ACL.
    Google ScholarLocate open access versionFindings
  • Werbos, P. 1990. Back-propagation Through Time: what it does and how to do it. In Proceedings of IEEE.
    Google ScholarLocate open access versionFindings
  • Zaremba, W.; Sutskever, I.; and Vinyals, O. 2014. Recurrent Neural Network Regularization. arXiv:1409.2329.
    Findings
  • Zhang, S.; Jiang, H.; Xu, M.; Hou, J.; and Dai, L. 2015. The FixedSize Ordinally-Forgetting Encoding Method for Neural Network Language Models. In Proceedings of ACL.
    Google ScholarLocate open access versionFindings
  • Zhang, X.; Zhao, J.; and LeCun, Y. 2015. Character-level Convolutional Networks for Text Classification. In Proceedings of NIPS.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments