# Data Noising as Smoothing in Neural Network Language Models

ICLR, Volume abs/1703.02573, 2017.

EI

Weibo:

Abstract:

Data noising is an effective technique for regularizing neural network models. While noising is widely adopted in application domains such as vision and speech, commonly used noising primitives have not been developed for discrete sequence-level settings such as language modeling. In this paper, we derive a connection between input noisin...More

Code:

Data:

Introduction

- Language models are a crucial component in many domains, such as autocompletion, machine translation, and speech recognition.
- A key challenge when performing estimation in language modeling is the data sparsity problem: due to large vocabulary sizes and the exponential number of possible contexts, the majority of possible sequences are rarely or never observed, even for very short subsequences.
- Neural network models have no notion of discrete counts, and instead use distributed representations to combat the curse of dimensionality (Bengio et al, 2003).
- Existing regularization methods are typically applied to weights or hidden units within the network (Srivastava et al, 2014; Le et al, 2015) instead of directly considering the input data

Highlights

- Language models are a crucial component in many domains, such as autocompletion, machine translation, and speech recognition
- Data augmentation has been key to improving the performance of neural network models in the face of insufficient data
- Widely-adopted noising primitives have not yet been developed for neural network language models
- We demonstrate the effectiveness of these schemes for regularization through experiments on language modeling and machine translation
- An L-layer recurrent neural network is modeled as h = fθ(h(t−l)1, h), where l denotes the layer index, h(0) contains the one-hot encoding of X, and in its simplest form fθ applies an affine transformation followed by a nonlinearity

Methods

- The authors consider language models where given a sequence of indices X = (x1, x2, · · · , xT ), over the vocabulary V , the authors model.
- Recurrent neural network (RNN) language models can model longer dependencies, since they operate over distributed hidden states instead of modeling an exponential number of discrete counts (Bengio et al, 2003; Mikolov, 2012).
- The authors consider encoder-decoder or sequence-to-sequence (Cho et al, 2014; Sutskever et al, 2014) models where given an input sequence X and output sequence Y of length TY , the authors model

Results

- The proposed bigram Kneser-Ney noising scheme gives an additional performance boost of +0.5-0.7 on top of the blank noising and unigram noising models, yielding a total gain of +1.4 BLEU

Conclusion

- 5.1 SCALING γ VIA DISCOUNTING

The authors examine whether discounting has the desired effect of noising subsequences according to their uncertainty. - Common tokens are often noised infrequently when discounting is used to rescale the noising probability, while rare tokens are noised comparatively much more frequently, where in the extreme case when a token appears exactly once, the authors have γAD = γ0.
- The authors compare the performance of models trained with a fixed γ0 versus a γ0 rescaled using discounting.
- The discounting ratio seems to effectively capture the “right” tokens to noise.In this work, the authors show that data noising is effective for regularizing neural network-based sequence models.
- Possible applications include exploring noising for improving performance in low resource settings, or examining how these techniques generalize to sequence modeling in other domains

Summary

## Introduction:

Language models are a crucial component in many domains, such as autocompletion, machine translation, and speech recognition.- A key challenge when performing estimation in language modeling is the data sparsity problem: due to large vocabulary sizes and the exponential number of possible contexts, the majority of possible sequences are rarely or never observed, even for very short subsequences.
- Neural network models have no notion of discrete counts, and instead use distributed representations to combat the curse of dimensionality (Bengio et al, 2003).
- Existing regularization methods are typically applied to weights or hidden units within the network (Srivastava et al, 2014; Le et al, 2015) instead of directly considering the input data
## Methods:

The authors consider language models where given a sequence of indices X = (x1, x2, · · · , xT ), over the vocabulary V , the authors model.- Recurrent neural network (RNN) language models can model longer dependencies, since they operate over distributed hidden states instead of modeling an exponential number of discrete counts (Bengio et al, 2003; Mikolov, 2012).
- The authors consider encoder-decoder or sequence-to-sequence (Cho et al, 2014; Sutskever et al, 2014) models where given an input sequence X and output sequence Y of length TY , the authors model
## Results:

The proposed bigram Kneser-Ney noising scheme gives an additional performance boost of +0.5-0.7 on top of the blank noising and unigram noising models, yielding a total gain of +1.4 BLEU## Conclusion:

5.1 SCALING γ VIA DISCOUNTING

The authors examine whether discounting has the desired effect of noising subsequences according to their uncertainty.- Common tokens are often noised infrequently when discounting is used to rescale the noising probability, while rare tokens are noised comparatively much more frequently, where in the extreme case when a token appears exactly once, the authors have γAD = γ0.
- The authors compare the performance of models trained with a fixed γ0 versus a γ0 rescaled using discounting.
- The discounting ratio seems to effectively capture the “right” tokens to noise.In this work, the authors show that data noising is effective for regularizing neural network-based sequence models.
- Possible applications include exploring noising for improving performance in low resource settings, or examining how these techniques generalize to sequence modeling in other domains

- Table1: Noising schemes Example noising schemes and their bigram smoothing analogues. Here we consider the bigram probability p(x1, x2) = p(x2|x1)p(x1). Notation: γ(x1:t) denotes the noising probability for a given input sequence x1:t, q(x) denotes the proposal distribution, and N1+(x, •) denotes the number of distinct bigrams in the training set where x is the first unigram. In all but the last case we only noise the context x1 and not the target prediction x2
- Table2: Single-model perplexity on Penn Treebank with different noising schemes. We also compare to the variational method of <a class="ref-link" id="cGal_2015_a" href="#rGal_2015_a">Gal (2015</a>), who also train LSTM models with the same hidden dimension. Note that performing Monte Carlo dropout at test time is significantly more expensive than our approach, where test time is unchanged
- Table3: Perplexity on Text8 with different noising schemes
- Table4: Perplexities and BLEU scores for machine translation task. Results for bigram KN noising on only the source sequence and only the target sequence are given as well
- Table5: Perplexity of last unigram for unseen bigrams and trigrams in Penn Treebank validation set. We compare noised and unnoised models with noising probabilities chosen such that models have near-identical perplexity on full validation set

Related work

- Our work can be viewed as a form of data augmentation, for which to the best of our knowledge there exists no widely adopted schemes in language modeling with neural networks. Classical regularization methods such as L2-regularization are typically applied to the model parameters, while dropout is applied to activations which can be along the forward as well as the recurrent directions (Zaremba et al, 2014; Semeniuta et al, 2016; Gal, 2015). Others have introduced methods for recurrent neural networks encouraging the hidden activations to remain stable in norm, or constraining the recurrent weight matrix to have eigenvalues close to one (Krueger & Memisevic, 2015; Arjovsky et al, 2015; Le et al, 2015). These methods, however, all consider weights and hidden units instead of the input data, and are motivated by the vanishing and exploding gradient problem.

Feature noising has been demonstrated to be effective for structured prediction tasks, and has been interpreted as an explicit regularizer (Wang et al, 2013). Additionally, Wager et al (2014) show that noising can inject appropriate generative assumptions into discriminative models to reduce their generalization error, but do not consider sequence models (Wager et al, 2016).

Funding

- ZX, SW, and JL were supported by an NDSEG Fellowship, NSERC PGS-D Fellowship, and Facebook Fellowship, respectively
- This project was funded in part by DARPA MUSE award FA8750-15-C-0242 AFRL/RIKF

Reference

- Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
- Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks. arXiv preprint arXiv:1511.06464, 2015.
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
- Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In Neural Information Processing Systems (NIPS), 2015.
- Yoshua Bengio, Rejean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model. In Journal Of Machine Learning Research, 2003.
- Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349, 2015.
- Peter F Brown, Peter V Desouza, Robert L Mercer, Vincent J Della Pietra, and Jenifer C Lai. Classbased n-gram models of natural language. Computational linguistics, 1992.
- Stanley F Chen and Joshua Goodman. An empirical study of smoothing techniques for language modeling. In Association for Computational Linguistics (ACL), 1996.
- Kyunghyun Cho, Bart Van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
- Andrew M Dai and Quoc V Le. Semi-supervised sequence learning. In Advances in Neural Information Processing Systems, pp. 3061–3069, 2015.
- Li Deng, Alex Acero, Mike Plumpe, and Xuedong Huang. Large-vocabulary speech recognition under adverse acoustic environments. In ICSLP, 2000.
- Yarin Gal. A theoretically grounded application of dropout in recurrent neural networks. arXiv:1512.05287, 2015.
- Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, et al. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567, 2014.
- Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neural computation, 1997.
- Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, and Hal Daume III. Deep unordered composition rivals syntactic methods for text classification. In Association for Computatonal Linguistics (ACL), 2015.
- A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In NIPS, 2012.
- David Krueger and Roland Memisevic. Regularizing rnns by stabilizing activations. arXiv preprint arXiv:1511.08400, 2015.
- Ankit Kumar, Ozan Irsoy, Jonathan Su, James Bradbury, Robert English, Brian Pierce, Peter Ondruska, Ishaan Gulrajani, and Richard Socher. Ask me anything: Dynamic memory networks for natural language processing. arXiv preprint arXiv:1506.07285, 2015.
- Quoc V Le, Navdeep Jaitly, and Geoffrey E Hinton. A simple way to initialize recurrent networks of rectified linear units. arXiv preprint arXiv:1504.00941, 2015.
- Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based Learning Applied to Document Recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Minh-Thang Luong and Christopher D Manning. Stanford neural machine translation systems for spoken language domains. In Proceedings of the International Workshop on Spoken Language Translation, 2015.
- Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. Effective approaches to attentionbased neural machine translation. In Empirical Methods in Natural Language Processing (EMNLP), 2015.
- Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
- Tomas Mikolov. Statistical language models based on neural networks. PhD thesis, PhD thesis, Brno University of Technology. 2012.[PDF], 2012.
- Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Association for Computational Linguistics, 2002.
- Vu Pham, Theodore Bluche, Christopher Kermorvant, and Jerome Louradour. Dropout improves recurrent neural networks for handwriting recognition. In Frontiers in Handwriting Recognition (ICFHR), 2014 14th International Conference on, 2014.
- Stanislau Semeniuta, Aliaksei Severyn, and Erhardt Barth. Recurrent dropout without memory loss. arXiv preprint arXiv:1603.05118, 2016.
- Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 2014.
- Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112, 2014.
- Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints, abs/1605.02688, May 2016. URL http://arxiv.org/abs/1605.02688.
- S. Wager, W. Fithian, S. I. Wang, and P. Liang. Altitude training: Strong bounds for single-layer dropout. In Advances in Neural Information Processing Systems (NIPS), 2014.
- Stefan Wager, William Fithian, and Percy Liang. Data augmentation via levy processes. arXiv preprint arXiv:1603.06340, 2016.
- Sida I Wang, Mengqiu Wang, Stefan Wager, Percy Liang, and Christopher D Manning. Feature noising for log-linear structured prediction. In Empirical Methods in Natural Language Processing (EMNLP), 2013.
- Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014.
- Julian Georg Zilly, Rupesh Kumar Srivastava, Jan Koutnık, and Jurgen Schmidhuber. Recurrent highway networks. arXiv preprint arXiv:1607.03474, 2016.

Full Text

Tags

Comments