Pointing the Unknown Words

ACL, pp. 140-149, 2016.

Cited by: 309|Bibtex|Views92|Links
Keywords:
gigaword datasetrare word problemunknown wordmachine translationunseen wordMore(8+)
Weibo:
600 5 10 15 20 25 30 35 40 # Iterations ification over the neural machine translation, our model is able to generalize to the unseen words and can deal with rarewords more efficiently

Abstract:

The problem of rare and unknown words is an important issue that can potentially effect the performance of many NLP systems, including traditional count-based and deep learning models. We propose a novel way to deal with the rare and unseen words for the neural network models using attention. Our model uses two softmax layers in order to ...More

Code:

Data:

0
Introduction
  • Words are the basic input/output units in most of the NLP systems, and the ability to cover a large number of words is a key to building a robust NLP system.
  • Even if the authors have a very large shortlist including all unique words in the training set, it does not necessarily improve the test performance, because there still exists a chance to see an unknown word at test time.
  • This is known as the unknown word problem.
  • Increasing the shortlist size mostly leads to increasing rare words due to Zipf’s Law
Highlights
  • Words are the basic input/output units in most of the NLP systems, and the ability to cover a large number of words is a key to building a robust NLP system
  • A common approach followed by the recent neural network based NLP systems is to use a softmax output layer where each of the output dimension corresponds to a word in a predefined word-shortlist
  • Even if we have a very large shortlist including all unique words in the training set, it does not necessarily improve the test performance, because there still exists a chance to see an unknown word at test time
  • Some of the words still need to be labeled as unknown word, i.e., if it is neither in the shortlist nor in the context, in experiments we show that this learning when and where to point improves the performance in machine translation and text summarization
  • In Section 3, we review the neural machine translation with attention mechanism which is the baseline in our experiments
  • 600 5 10 15 20 25 30 35 40 # Iterations (x5000 minibatches) ification over the neural machine translation, our model is able to generalize to the unseen words and can deal with rarewords more efficiently
Results
  • Some of the words still need to be labeled as UNK, i.e., if it is neither in the shortlist nor in the context, in experiments the authors show that this learning when and where to point improves the performance in machine translation and text summarization.
Conclusion
  • The authors propose a simple extension to the traditional soft attention-based shortlist softmax by using pointers over the input sequence.
  • The authors observe noticeable improvements over the baselines on machine translation and summarization tasks by using pointer softmax.
  • By doing a very simple mod- Original model Pointer Softmax.
  • In the case of neural machine translation, the authors observed that the training with the pointer softmax is improved the convergence speed of the model as well.
  • For French to English machine translation on Europarl corpora, the authors observe that using the pointer softmax can improve the training convergence of the model
Summary
  • Introduction:

    Words are the basic input/output units in most of the NLP systems, and the ability to cover a large number of words is a key to building a robust NLP system.
  • Even if the authors have a very large shortlist including all unique words in the training set, it does not necessarily improve the test performance, because there still exists a chance to see an unknown word at test time.
  • This is known as the unknown word problem.
  • Increasing the shortlist size mostly leads to increasing rare words due to Zipf’s Law
  • Results:

    Some of the words still need to be labeled as UNK, i.e., if it is neither in the shortlist nor in the context, in experiments the authors show that this learning when and where to point improves the performance in machine translation and text summarization.
  • Conclusion:

    The authors propose a simple extension to the traditional soft attention-based shortlist softmax by using pointers over the input sequence.
  • The authors observe noticeable improvements over the baselines on machine translation and summarization tasks by using pointer softmax.
  • By doing a very simple mod- Original model Pointer Softmax.
  • In the case of neural machine translation, the authors observed that the training with the pointer softmax is improved the convergence speed of the model as well.
  • For French to English machine translation on Europarl corpora, the authors observe that using the pointer softmax can improve the training convergence of the model
Tables
  • Table1: Results on Gigaword Corpus when pointers are used for UNKs in the training data, using Rouge-F1 as the evaluation metric
  • Table2: Results on anonymized Gigaword Corpus when pointers are used for entities, using Rouge-
  • Table3: Results on Gigaword Corpus for modeling UNK’s with pointers in terms of recall
  • Table4: Generated summaries from NMT with PS. Boldface words are the words copied from the source
  • Table5: Europarl Dataset (EN-FR)
Download tables as Excel
Related work
  • The attention-based pointing mechanism is introduced first in the pointer networks (Vinyals et al, 2015). In the pointer networks, the output space of the target sequence is constrained to be the observations in the input sequence (not the input space). Instead of having a fixed dimension softmax output layer, softmax outputs of varying dimension is dynamically computed for each input sequence in such a way to maximize the attention probability of the target input. However, its applicability is rather limited because, unlike our model, there is no option to choose whether to point or not; it always points. In this sense, we can see the pointer networks as a special case of our model where we always choose to point a context word.

    Several approaches have been proposed towards solving the rare words/unknown words problem, which can be broadly divided into three categories. The first category of the approaches focuses on improving the computation speed of the softmax output so that it can maintain a very large vocabulary. Because this only increases the shortlist size, it helps to mitigate the unknown word problem, but still suffers from the rare word problem. The hierarchical softmax (Morin and Bengio, 2005), importance sampling (Bengio and Senecal, 2008; Jean et al, 2014), and the noise contrastive estimation (Gutmann and Hyvarinen, 2012; Mnih and Kavukcuoglu, 2013) methods are in the class.
Funding
  • We acknowledge the support of the following organizations for research funding and computing support: NSERC, Samsung, Calcul Quebec, Compute Canada, the Canada Research Chairs and CIFAR
  • Watson Research for funding this research during his internship between October 2015 and January 2016
Reference
  • [Bahdanau et al.2014] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473.
    Findings
  • [Hermann et al.2015] Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems, pages 1684–1692.
    Google ScholarLocate open access versionFindings
  • [Bengio and Senecal2008] Yoshua Bengio and JeanSebastien Senecal. 2008. Adaptive importance sampling to accelerate training of a neural probabilistic language model. Neural Networks, IEEE Transactions on, 19(4):713–722.
    Google ScholarLocate open access versionFindings
  • [Bordes et al.2015] Antoine Bordes, Nicolas Usunier, Sumit Chopra, and Jason Weston. 2015. Largescale simple question answering with memory networks. arXiv preprint arXiv:1506.02075.
    Findings
  • [Cheng and Lapata2016] Jianpeng Cheng and Mirella Lapata. 2016. Neural summarization by ex-
    Google ScholarFindings
  • [Jean et al.2014] Sebastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2014. On using very large target vocabulary for neural machine translation. arXiv preprint arXiv:1412.2007.
    Findings
  • [Kingma and Adam2015] Diederik P Kingma and Jimmy Ba Adam. 2015. A method for stochastic optimization. In International Conference on Learning Representation.
    Google ScholarLocate open access versionFindings
  • [Luong et al.2015] Minh-Thang Luong, Ilya Sutskever, Quoc V Le, Oriol Vinyals, and Wojciech Zaremba. 2015. Addressing the rare word problem in neural machine translation. In Proceedings of ACL.
    Google ScholarLocate open access versionFindings
  • [Matthews et al.2012] Danielle Matthews, Tanya Behne, Elena Lieven, and Michael Tomasello. 2012. Origins of the human pointing gesture: a training study. Developmental science, 15(6):817– 829.
    Google ScholarLocate open access versionFindings
  • [Mnih and Kavukcuoglu2013] Andriy Mnih and Koray Kavukcuoglu. 2013. Learning word embeddings efficiently with noise-contrastive estimation. In Advances in Neural Information Processing Systems, pages 2265–2273.
    Google ScholarLocate open access versionFindings
  • [Morin and Bengio2005] Frederic Morin and Yoshua Bengio. 2005. Hierarchical probabilistic neural network language model. In Aistats, volume 5, pages 246–252. Citeseer.
    Google ScholarLocate open access versionFindings
  • [Pascanu et al.2012] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 20On the difficulty of training recurrent neural networks. arXiv preprint arXiv:1211.5063.
    Findings
  • [Pascanu et al.2013] Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. 20How to construct deep recurrent neural networks. arXiv preprint arXiv:1312.6026.
    Findings
  • [Rush et al.2015] Alexander M. Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. CoRR, abs/1509.00685.
    Findings
  • [Schuster and Paliwal1997] Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural networks. Signal Processing, IEEE Transactions on, 45(11):2673–2681.
    Google ScholarLocate open access versionFindings
  • [Sennrich et al.2015] Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
    Findings
  • [Theano Development Team2016] Theano Development Team. 2016. Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints, abs/1605.02688, May.
    Findings
  • [Tomasello et al.2007] Michael Tomasello, Malinda Carpenter, and Ulf Liszkowski. 2007. A new look at infant pointing. Child development, 78(3):705–722.
    Google ScholarLocate open access versionFindings
  • [Vinyals et al.2015] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer networks. In Advances in Neural Information Processing Systems, pages 2674–2682.
    Google ScholarLocate open access versionFindings
  • [Zeiler2012] Matthew D Zeiler. 2012. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701.
    Findings
Your rating :
0

 

Tags
Comments