Few-Shot Representation Learning for Out-Of-Vocabulary Words

ACL (1), pp. 4102-4112, 2019.

Cited by: 5|Bibtex|Views66|DOI:https://doi.org/10.18653/v1/p19-1402
EI
Other Links: academic.microsoft.com|dblp.uni-trier.de|arxiv.org
Weibo:
The experiment demonstrates that hierarchical context encoder trained on DT is already able to leverage the general language knowledge which can be transferred through different domains, and adaptation with ModelAgnostic Meta-Learning can further reduce the domain gap and enhance...

Abstract:

Existing approaches for learning word embeddings often assume there are sufficient occurrences for each word in the corpus, such that the representation of words can be accurately estimated from their contexts. However, in real-world scenarios, out-of-vocabulary (a.k.a. OOV) words that do not appear in training corpus emerge frequently....More

Code:

Data:

0
Introduction
  • Distributional word embedding models aim to assign each word a low-dimensional vector representing its semantic meaning
  • These embedding models have been used as key components in natural language processing systems.
  • To learn such embeddings, existing approaches such as skip-gram models (Mikolov et al, 2013) resort to an auxiliary task of predicting the context words.
  • This leads them to the following research problem: How can the authors learn accurate embedding vectors for OOV words during the inference time by observing their usages for only a few times?
Highlights
  • Distributional word embedding models aim to assign each word a low-dimensional vector representing its semantic meaning
  • OOV words may occur in a new corpus whose domain or linguistic usages are different from the main training corpus. To deal with this issue, we propose to adopt Model-Agnostic Meta-Learning (MAML) (Finn et al, 2017) to assist the fast and robust adaptation of a pre-trained hierarchical context encoder model, which allows hierarchical context encoder to better infer the embeddings of OOV words in a new domain by starting from a promising initialization
  • Part-of-Speech Tagging Besides named entity recognition, we evaluate the syntactic information encoded in hierarchical context encoder through a lens of part-of-speech (POS) tagging, which is a standard task with a goal to identify which grammatical group a word belongs to
  • All the systems perform worse on Rare-named entity recognition than Bio-named entity recognition, while hierarchical context encoder reaches the largest improvement than all the other baselines
  • The experiment demonstrates that hierarchical context encoder trained on DT is already able to leverage the general language knowledge which can be transferred through different domains, and adaptation with ModelAgnostic Meta-Learning can further reduce the domain gap and enhance the performance
  • We further adopt ModelAgnostic Meta-Learning for fast and robust adaptation to mitigate semantic gap between corpus. Experiments on both benchmark corpus and downstream tasks demonstrate the superiority of hierarchical context encoder over existing approaches
Methods
  • The authors compare HiCE with the following baseline models for learning OOV word embeddings.

    Word2Vec: The local updating algorithm of Word2Vec.
  • Word2vec FastText Additive Additive, no stop words nonce2vec ala carte.
  • HiCE w/o Morph HiCE + Morph HiCE + Morph + Fine-tune HiCE + Morph + MAML Oracle Embedding 2-shot 4-shot.
  • 6-shot and the OOV words.
  • To evaluate the performance of a learned embedding, Spearman correlation is used in (Lazaridou et al, 2017) to measure the agreement between the human annotations and the machine-generated results
Results
  • Experimental Results Table

    1 lists the performance of HiCE and baselines with different numbers of context sentences.
  • As is shown, when the number of context sentences (K) is relatively large (i.e., K = 6), the performance of HiCE is on a par with the upper bound (Oracle Embedding) and the relative performance difference is merely 2.7%.
  • This indicates the significance of using an advanced aggregation model.
  • The experiment demonstrates that HiCE trained on DT is already able to leverage the general language knowledge which can be transferred through different domains, and adaptation with MAML can further reduce the domain gap and enhance the performance
Conclusion
  • The authors studied the problem of learning accurate embedding for Out-Of-Vocabulary word and augment them to a per-trained embedding by only a few observations.
  • The authors formulated the problem as a K-shot regression problem and proposed a hierarchical context encoder (HiCE) architecture that learns to predict the oracle OOV embedding by aggregating only K contexts and morphological features.
  • The authors further adopt MAML for fast and robust adaptation to mitigate semantic gap between corpus.
  • Experiments on both benchmark corpus and downstream tasks demonstrate the superiority of HiCE over existing approaches
Summary
  • Introduction:

    Distributional word embedding models aim to assign each word a low-dimensional vector representing its semantic meaning
  • These embedding models have been used as key components in natural language processing systems.
  • To learn such embeddings, existing approaches such as skip-gram models (Mikolov et al, 2013) resort to an auxiliary task of predicting the context words.
  • This leads them to the following research problem: How can the authors learn accurate embedding vectors for OOV words during the inference time by observing their usages for only a few times?
  • Methods:

    The authors compare HiCE with the following baseline models for learning OOV word embeddings.

    Word2Vec: The local updating algorithm of Word2Vec.
  • Word2vec FastText Additive Additive, no stop words nonce2vec ala carte.
  • HiCE w/o Morph HiCE + Morph HiCE + Morph + Fine-tune HiCE + Morph + MAML Oracle Embedding 2-shot 4-shot.
  • 6-shot and the OOV words.
  • To evaluate the performance of a learned embedding, Spearman correlation is used in (Lazaridou et al, 2017) to measure the agreement between the human annotations and the machine-generated results
  • Results:

    Experimental Results Table

    1 lists the performance of HiCE and baselines with different numbers of context sentences.
  • As is shown, when the number of context sentences (K) is relatively large (i.e., K = 6), the performance of HiCE is on a par with the upper bound (Oracle Embedding) and the relative performance difference is merely 2.7%.
  • This indicates the significance of using an advanced aggregation model.
  • The experiment demonstrates that HiCE trained on DT is already able to leverage the general language knowledge which can be transferred through different domains, and adaptation with MAML can further reduce the domain gap and enhance the performance
  • Conclusion:

    The authors studied the problem of learning accurate embedding for Out-Of-Vocabulary word and augment them to a per-trained embedding by only a few observations.
  • The authors formulated the problem as a K-shot regression problem and proposed a hierarchical context encoder (HiCE) architecture that learns to predict the oracle OOV embedding by aggregating only K contexts and morphological features.
  • The authors further adopt MAML for fast and robust adaptation to mitigate semantic gap between corpus.
  • Experiments on both benchmark corpus and downstream tasks demonstrate the superiority of HiCE over existing approaches
Tables
  • Table1: Performance on the Chimera benchmark dataset with different numbers of context sentences, which is measured by Spearman correlation. Baseline results are from the corresponding papers
  • Table2: Performance on Named Entity Recognition and Part-of-Speech Tagging tasks. All methods are evaluated on test data containing OOV words. Results demonstrate that the proposed approach, HiCE + Morph + MAML, improves the downstream model by learning better representations for OOV words
  • Table3: For each OOV in Chimera benchmark, infer its embedding using different methods, then show top-5 words with similar embedding to the inferred embedding. HiCE can find words with most similar semantics
Download tables as Excel
Related work
  • OOV Word Embedding Previous studies of handling OOV words were mainly based on two types of information: 1) context information and 2) morphology features.

    The first family of approaches follows the distributional hypothesis (Firth, 1957) to infer the meaning of a target word based on its context. If sufficient observations are given, simply applying existing word embedding techniques (e.g., word2vec) can already learn to embed OOV words. However, in a real scenario, mostly the OOV word only occur for a very limited times in the new corpus, which hinders the quality of the updated embedding (Lazaridou et al, 2017; Herbelot and Baroni, 2017). Several alternatives have been proposed in the literature. Lazaridou et al (2017) proposed additive method by using the average embeddings of context words (Lazaridou et al, 2017) as the embedding of the target word. Herbelot and Baroni (2017) extended the skip-gram model to nonce2vec by initialized with additive embedding, higher learning rate and window size. Khodak et al (2018) introduced ala carte, which augments the additive method by a linear transformation of context embedding.
Funding
  • This work is partially supported by NSF RI-1760523, NSF III-1705169, NSF CAREER Award 1741634, and Amazon Research Award
Reference
  • Lei Jimmy Ba, Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization. CoRR, abs/1607.06450.
    Findings
  • Yoshua Bengio. 201Deep learning of representations for unsupervised and transfer learning. In Unsupervised and Transfer Learning - Workshop held at ICML 2011, Bellevue, Washington, USA, July 2, 2011, pages 17–36.
    Google ScholarFindings
  • Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. TACL, 5:135–146.
    Google ScholarLocate open access versionFindings
  • Jiatao Gu, Yong Wang, Yun Chen, Victor O. K. Li, and Kyunghyun Cho. 2018. Meta-learning for lowresource neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 3622–3631. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • William L. Hamilton, Jure Leskovec, and Dan Jurafsky. 2016. Diachronic word embeddings reveal statistical laws of semantic change. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers.
    Google ScholarLocate open access versionFindings
  • Trevor Cohn, Steven Bird, Graham Neubig, Oliver Adams, and Adam J. Makarucha. 2017. Crosslingual word embeddings for low-resource language modeling. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3-7, 2017, Volume 1: Long Papers, pages 937– 947. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Nigel Collier and Jin-Dong Kim. 2004. Introduction to the bio-entity recognition task at JNLPBA. In NLPBA/BioNLP.
    Google ScholarFindings
  • Leon Derczynski, Eric Nichols, Marieke van Erp, and Nut Limsopatham. 2017. Results of the WNUT2017 shared task on novel and emerging entity recognition. In NUT@EMNLP, pages 140–147. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. 2014. Decaf: A deep convolutional activation feature for generic visual recognition. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, pages 647–655.
    Google ScholarLocate open access versionFindings
  • Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 1126–1135.
    Google ScholarLocate open access versionFindings
  • John R. Firth. 1957. A synopsis of linguistic theory, 1930-1955. Studies in Linguistic Analysis.
    Google ScholarLocate open access versionFindings
  • Tianyu Gao, Xu Han, Zhiyuan Liu, and Maosong Sun. 2019. Hybrid attention-based prototypical networks for noisy few-shot relation classification. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-19), New York, USA, April 15-18, 2019.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 770–778. IEEE Computer Society.
    Google ScholarLocate open access versionFindings
  • Aurelie Herbelot and Marco Baroni. 2017. High-risk learning: acquiring new word vectors from tiny data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pages 304–309. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Mikhail Khodak, Nikunj Saunshi, Yingyu Liang, Tengyu Ma, Brandon Stewart, and Sanjeev Arora. 2018. A la carte embedding: Cheap but effective induction of semantic feature vectors. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pages 12–22. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yoon Kim, Yacine Jernite, David Sontag, and Alexander M. Rush. 20Character-aware neural language models. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 1217, 2016, Phoenix, Arizona, USA., pages 2741– 2749.
    Google ScholarLocate open access versionFindings
  • Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. In HLT-NAACL, pages 260–270. The Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Angeliki Lazaridou, Marco Marelli, and Marco Baroni. 2017. Multimodal word meaning induction from minimal exposure to natural text. Cognitive Science.
    Google ScholarFindings
  • Thang Luong, Richard Socher, and Christopher D. Manning. 2013. Better word representations with recursive neural networks for morphology. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, CoNLL 2013, Sofia, Bulgaria, August 8-9, 2013, pages 104–113. ACL.
    Google ScholarLocate open access versionFindings
  • Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2017. Pointer sentinel mixture models. In ICLR’17.
    Google ScholarFindings
  • Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS, pages 3111–3119.
    Google ScholarLocate open access versionFindings
  • Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In NAACL-HLT, pages 2227–2237. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yuval Pinter, Robert Guthrie, and Jacob Eisenstein. 2017. Mimicking word embeddings using subword rnns. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pages 102–112. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Mo Yu, Xiaoxiao Guo, Jinfeng Yi, Shiyu Chang, Saloni Potdar, Yu Cheng, Gerald Tesauro, Haoyu Wang, and Bowen Zhou. 2018. Diverse few-shot text classification with multiple metrics. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pages 1206–1215.
    Google ScholarLocate open access versionFindings
  • Barret Zoph, Deniz Yuret, Jonathan May, and Kevin Knight. 2016. Transfer learning for low-resource neural machine translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pages 1568–1575.
    Google ScholarLocate open access versionFindings
  • Sachin Ravi and Hugo Larochelle. 2017. Optimization as a model for few-shot learning. In ICLR.
    Google ScholarFindings
  • Alan Ritter, Sam Clark, Mausam, and Oren Etzioni. 2011. Named entity recognition in tweets: An experimental study. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 1524–1534. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Prathusha K. Sarma, Yingyu Liang, and Bill Sethares. 2018. Domain adapted word embeddings for improved sentiment classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 2: Short Papers, pages 37–42.
    Google ScholarLocate open access versionFindings
  • Jake Snell, Kevin Swersky, and Richard S. Zemel. 2017. Prototypical networks for few-shot learning. In NIPS, pages 4080–4090.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS, pages 6000–6010.
    Google ScholarLocate open access versionFindings
  • Oriol Vinyals, Charles Blundell, Tim Lillicrap, Koray Kavukcuoglu, and Daan Wierstra. 2016. Matching networks for one shot learning. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 3630–3638.
    Google ScholarLocate open access versionFindings
  • Wenhan Xiong, Mo Yu, Shiyu Chang, Xiaoxiao Guo, and William Yang Wang. 2018. One-shot relational learning for knowledge graphs. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 1980–1990.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments