Moving Down the Long Tail of Word Sense Disambiguation with Gloss-Informed Biencoders

arxiv, 2020.

Cited by: 0|Bibtex|Views48
Other Links: arxiv.org
Weibo:
Most recent Word Sense Disambiguation systems show a similar trend: even the representations of frozen BERTbase that are not fine-tuned on Word Sense Disambiguation can achieve over 94 F1 on examples labeled with the most frequent sense

Abstract:

A major obstacle in Word Sense Disambiguation (WSD) is that word senses are not uniformly distributed, causing existing models to generally perform poorly on senses that are either rare or unseen during training. We propose a bi-encoder model that independently embeds (1) the target word with its surrounding context and (2) the dictiona...More

Code:

Data:

0
Introduction
  • One of the major challenges of Word Sense Disambiguation (WSD) is overcoming the data sparsity that stems from the Zipfian distribution of senses in natural language (Kilgarriff, 2004).
  • In SemCor 90% of mentions of the word plant correspond to the top two senses of the word, and only half of the ten senses of plant occur in the dataset at all (Miller et al, 1993)
  • Due to this data imbalance, many WSD systems show a strong bias towards predicting the most frequent sense (MFS) of a word regardless of the surrounding context (Postma et al, 2016).
  • Other neural approaches used semi-supervised learning to augment the learned representations with additional data (Melamud et al, 2016; Yuan et al, 2016)
Highlights
  • One of the major challenges of Word Sense Disambiguation (WSD) is overcoming the data sparsity that stems from the Zipfian distribution of senses in natural language (Kilgarriff, 2004)
  • We address the issue of Word Sense Disambiguation systems underperforming on uncommon senses of words
  • We present a bi-encoder model (BEM) that maps senses and ambiguous words into the same embedding space by jointly optimizing the context and glosses encoders
  • The bi-encoder model disambiguates the sense of each word by assigning it the label of the nearest sense embedding. This approach leads to a 31.1% error reduction over prior work on the less frequent sense examples
  • We still see a large gap in performance between most frequent sense and less frequent sense examples, with our model still performing over 40 points better on the most frequent sense subset
  • Most recent Word Sense Disambiguation systems show a similar trend: even the representations of frozen BERTbase that are not fine-tuned on Word Sense Disambiguation can achieve over 94 F1 on examples labeled with the most frequent sense
Methods
  • The authors present an approach for WSD that is designed to more accurately model less frequent senses by better leveraging the glosses that define them.
  • The authors' bi-encoder model (BEM) consists of two independent encoders: (1) a context encoder, which represents the target word and (2) a gloss encoder, that embeds the definition text for each word sense
  • These encoders are trained to embed each token near the representation of its correct word sense.
  • The authors formally define the task of WSD (Section 3.1), and present the BEM system in detail (Section 3.2)
Results
  • The authors find that the BEM achieves the best F1 score on the aggregated ALL evaluation set, outperforming all baselines and prior work by at least 2 F1 points.
  • This improvement holds across all of the evaluation sets in the WSD evaluation framework as well as for each part-of-speech on which the authors perform WSD.
  • The authors see that many of the prior approaches considered build on pretrained models, the authors empirically observe that the bi-encoder model is a strong method for leveraging BERT
Conclusion
  • The authors address the issue of WSD systems underperforming on uncommon senses of words.
  • The BEM disambiguates the sense of each word by assigning it the label of the nearest sense embedding.
  • This approach leads to a 31.1% error reduction over prior work on the less frequent sense examples.
  • Most recent WSD systems show a similar trend: even the representations of frozen BERTbase that are not fine-tuned on WSD can achieve over 94 F1 on examples labeled with the most frequent sense
Summary
  • Introduction:

    One of the major challenges of Word Sense Disambiguation (WSD) is overcoming the data sparsity that stems from the Zipfian distribution of senses in natural language (Kilgarriff, 2004).
  • In SemCor 90% of mentions of the word plant correspond to the top two senses of the word, and only half of the ten senses of plant occur in the dataset at all (Miller et al, 1993)
  • Due to this data imbalance, many WSD systems show a strong bias towards predicting the most frequent sense (MFS) of a word regardless of the surrounding context (Postma et al, 2016).
  • Other neural approaches used semi-supervised learning to augment the learned representations with additional data (Melamud et al, 2016; Yuan et al, 2016)
  • Methods:

    The authors present an approach for WSD that is designed to more accurately model less frequent senses by better leveraging the glosses that define them.
  • The authors' bi-encoder model (BEM) consists of two independent encoders: (1) a context encoder, which represents the target word and (2) a gloss encoder, that embeds the definition text for each word sense
  • These encoders are trained to embed each token near the representation of its correct word sense.
  • The authors formally define the task of WSD (Section 3.1), and present the BEM system in detail (Section 3.2)
  • Results:

    The authors find that the BEM achieves the best F1 score on the aggregated ALL evaluation set, outperforming all baselines and prior work by at least 2 F1 points.
  • This improvement holds across all of the evaluation sets in the WSD evaluation framework as well as for each part-of-speech on which the authors perform WSD.
  • The authors see that many of the prior approaches considered build on pretrained models, the authors empirically observe that the bi-encoder model is a strong method for leveraging BERT
  • Conclusion:

    The authors address the issue of WSD systems underperforming on uncommon senses of words.
  • The BEM disambiguates the sense of each word by assigning it the label of the nearest sense embedding.
  • This approach leads to a 31.1% error reduction over prior work on the less frequent sense examples.
  • Most recent WSD systems show a similar trend: even the representations of frozen BERTbase that are not fine-tuned on WSD can achieve over 94 F1 on examples labeled with the most frequent sense
Tables
  • Table1: F1-score (%) on the English all-words WSD task. ALL is the concatenation of all datasets, including the development set SE07. We compare our bi-encoder model (BEM) against the WordNet S1 and most frequent sense (MFS) baselines, as well as a frozen BERT-base classifier and recent prior work on this task
  • Table2: F1-score (%) on the MFS, LFS, and zero-shot subsets of the ALL evaluation set. Zero-shot examples are the words and senses (respectively) that do not occur in the training data. The balanced BEM system, BEM-bal, is considered in Section 6.2
  • Table3: Ablations on the bi-encoder model (BEM). We consider the effect of freezing each of the two encoders and of tying the parameters of the encoders on development set performance
  • Table4: Performance of various pretrained encoders on the WSD development set
Download tables as Excel
Funding
  • This material is based on work conducted at the University of Washington, which was supported by the National Science Foundation Graduate Research Fellowship Program under Grant No DGE1762114
Reference
  • Dzmitry Bahdanau, Tom Bosc, Stanisław Jastrzebski, Edward Grefenstette, Pascal Vincent, and Yoshua Bengio. 2017. Learning to compute word embeddings on the fly. arXiv preprint arXiv:1706.00286.
    Findings
  • Satanjeev Banerjee and Ted Pedersen. 2003. Extended gloss overlaps as a measure of semantic relatedness. In IJCAI.
    Google ScholarFindings
  • Pierpaolo Basile, Annalina Caputo, and Giovanni Semeraro. 2014. An enhanced lesk word sense disambiguation algorithm through a distributional semantic model. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pages 1591–1600.
    Google ScholarLocate open access versionFindings
  • Tom Bosc and Pascal Vincent. 2018. Auto-encoding dictionary definitions into consistent word embeddings. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1522–1532.
    Google ScholarLocate open access versionFindings
  • Claudio Delli Bovi, Luis Espinosa Anke, and Roberto Navigli. 201Knowledge base unification via sense embeddings and disambiguation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 726–736.
    Google ScholarLocate open access versionFindings
  • Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Sackinger, and Roopak Shah. 1994. Signature verification using a” siamese” time delay neural network. In Advances in neural information processing systems, pages 737–744.
    Google ScholarLocate open access versionFindings
  • Massimiliano Ciaramita and Yasemin Altun. 2006. Broad-coverage sense disambiguation and information extraction with a supersense sequence tagger. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, EMNLP
    Google ScholarLocate open access versionFindings
  • ’06, pages 594–602, Stroudsburg, PA, USA. Association for Computational Linguistics.
    Google ScholarFindings
  • Andy Coenen, Emily Reif, Ann Yuan, Been Kim, Adam Pearce, Fernanda Viegas, and Martin Wattenberg. 201Visualizing and measuring the geometry of bert. arXiv preprint arXiv:1906.02715.
    Findings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
    Google ScholarLocate open access versionFindings
  • Christian Hadiwinoto, Hwee Tou Ng, and Wee Chung Gan. 2019. Improved word sense disambiguation using pre-trained contextualized word representations. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5300– 5309.
    Google ScholarLocate open access versionFindings
  • Luyao Huang, Chi Sun, Xipeng Qiu, and Xuan-Jing Huang. 2019. Glossbert: Bert for word sense disambiguation with gloss knowledge. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3500–3505.
    Google ScholarLocate open access versionFindings
  • Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux, and Jason Weston. 2019. Poly-encoders: Transformer architectures and pre-training strategies for fast and accurate multi-sentence scoring. arXiv preprint arXiv:1905.01969.
    Findings
  • Ignacio Iacobacci, Mohammad Taher Pilehvar, and Roberto Navigli. 2016. Embeddings for word sense disambiguation: An evaluation study. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 897–907.
    Google ScholarLocate open access versionFindings
  • Mikael Kageback and Hans Salomonsson. 2016. Word sense disambiguation using a bidirectional lstm. arXiv preprint arXiv:1606.03568.
    Findings
  • Adam Kilgarriff. 2004. How dominant is the commonest sense of a word? In International conference on text, speech and dialogue, pages 103–111. Springer.
    Google ScholarLocate open access versionFindings
  • Diederik P. Kingma and Jimmy Lei Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of 3rd International Conference of Learning Representations.
    Google ScholarLocate open access versionFindings
  • Sawan Kumar, Sharmistha Jat, Karan Saxena, and Partha Talukdar. 2019. Zero-shot word sense disambiguation using sense definition embeddings. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5670–5681, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Michael Lesk. 1986. Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone. In Proceedings of the 5th annual international conference on Systems documentation, pages 24–26. ACM.
    Google ScholarLocate open access versionFindings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
    Findings
  • Daniel Loureiro and Alıpio Jorge. 2019. Language modelling makes sense: Propagating representations through WordNet for full-coverage word sense disambiguation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5682–5691, Florence, Italy. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Fuli Luo, Tianyu Liu, Zexue He, Qiaolin Xia, Zhifang Sui, and Baobao Chang. 2018a. Leveraging gloss knowledge in neural word sense disambiguation by hierarchical co-attention. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1402–1411, Brussels, Belgium. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Fuli Luo, Tianyu Liu, Qiaolin Xia, Baobao Chang, and Zhifang Sui. 2018b. Incorporating glosses into neural word sense disambiguation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2473–2482, Melbourne, Australia. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Oren Melamud, Jacob Goldberger, and Ido Dagan. 2016. context2vec: Learning generic context embedding with bidirectional lstm. In Proceedings of the 20th SIGNLL conference on computational natural language learning, pages 51–61.
    Google ScholarLocate open access versionFindings
  • George A Miller. 1995. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39– 41.
    Google ScholarLocate open access versionFindings
  • George A Miller, Claudia Leacock, Randee Tengi, and Ross T Bunker. 1993. A semantic concordance. In Proceedings of the workshop on Human Language Technology, pages 303–308. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Andrea Moro and Roberto Navigli. 2015. Semeval2015 task 13: Multilingual all-words sense disambiguation and entity linking. In Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015), pages 288–297.
    Google ScholarLocate open access versionFindings
  • Roberto Navigli. 2009. Word sense disambiguation: A survey. ACM computing surveys (CSUR), 41(2):10.
    Google ScholarLocate open access versionFindings
  • Roberto Navigli, David Jurgens, and Daniele Vannella. 2013. Semeval-2013 task 12: Multilingual word sense disambiguation. In Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), pages 222–231.
    Google ScholarLocate open access versionFindings
  • Steven Neale, Luıs Gomes, Eneko Agirre, Oier Lopez de Lacalle, and Antonio Branco. 2016. Word senseaware machine translation: Including senses as contextual features for improved translation models. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 2777–2783, Portoroz, Slovenia. European Language Resources Association (ELRA).
    Google ScholarLocate open access versionFindings
  • Martha Palmer, Christiane Fellbaum, Scott Cotton, Lauren Delfs, and Hoa Trang Dang. 2001. English tasks: All-words and verb lexical sample. In Proceedings of SENSEVAL-2 Second International Workshop on Evaluating Word Sense Disambiguation Systems, pages 21–24.
    Google ScholarLocate open access versionFindings
  • Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the North American Association for Computational Linguistics (NAACL).
    Google ScholarLocate open access versionFindings
  • Marten Postma, Ruben Izquierdo Bevia, and Piek Vossen. 2016. More is not always better: balancing sense distributions for all-words word sense disambiguation. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 3496–3506.
    Google ScholarLocate open access versionFindings
  • Sameer Pradhan, Edward Loper, Dmitriy Dligach, and Martha Palmer. 2007. Semeval-2007 task-17: English lexical sample, srl and all words. In Proceedings of the fourth international workshop on semantic evaluations (SemEval-2007), pages 87–92.
    Google ScholarLocate open access versionFindings
  • Alessandro Raganato, Claudio Delli Bovi, and Roberto Navigli. 2017a. Neural sequence learning models for word sense disambiguation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1156–1167.
    Google ScholarLocate open access versionFindings
  • Alessandro Raganato, Jose Camacho-Collados, and Roberto Navigli. 2017b. Word sense disambiguation: A unified evaluation framework and empirical comparison. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 99–110.
    Google ScholarLocate open access versionFindings
  • Annette Rios Gonzales, Laura Mascarell, and Rico Sennrich. 2017. Improving word sense disambiguation in neural machine translation with sense embeddings. In Proceedings of the Second Conference on Machine Translation, pages 11–19, Copenhagen, Denmark. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Sascha Rothe and Hinrich Schutze. 2015. AutoExtend: Extending word embeddings to embeddings for synsets and lexemes. In Proceedings of the
    Google ScholarLocate open access versionFindings
  • 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1793–1803, Beijing, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Hui Shen, Razvan Bunescu, and Rada Mihalcea. 2013. Coarse to fine grained sense disambiguation in wikipedia. In Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity, pages 22–31.
    Google ScholarLocate open access versionFindings
  • Benjamin Snyder and Martha Palmer. 2004. The english all-words task. In Proceedings of SENSEVAL3, the Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text.
    Google ScholarLocate open access versionFindings
  • Gabriel Stanovsky and Mark Hopkins. 2018. Spot the odd man out: Exploring the associative power of lexical resources. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1533–1542.
    Google ScholarLocate open access versionFindings
  • Loıc Vial, Benjamin Lecouteux, and Didier Schwab. 2019. Sense vocabulary compression through the semantic knowledge of wordnet for neural word sense disambiguation. In Proceedings of the 10th Global WordNet Conference (GWC).
    Google ScholarLocate open access versionFindings
  • David Vickrey, Luke Biewald, Marc Teyssier, and Daphne Koller. 2005. Word-sense disambiguation for machine translation. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, HLT ’05, pages 771–778, Stroudsburg, PA, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • We present additional sense embedding space visualizations (Figures 4, 5, and 6). These visualizations are generated identically to the one discussed in Section 6.3. In each figure, the left visualization shows the representations output by a frozen BERTbase model, and the right one shows the output of our BEM’s context encoder. All figures are visualized with t-SNE. We choose words from SemCor that occur more than 50 times; for clarity, we limit the visualization to the six most common senses of each word. All senses and glosses are gathered from WordNet (Miller, 1995).
    Google ScholarFindings
  • Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R’emi Louf, Morgan Funtowicz, and Jamie Brew. 2019. Huggingface’s transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771.
    Findings
  • Dayu Yuan, Julian Richardson, Ryan Doherty, Colin Evans, and Eric Altendorf. 2016. Semi-supervised word sense disambiguation with neural models. arXiv preprint arXiv:1603.07012.
    Findings
  • Zhi Zhong and Hwee Tou Ng. 2010. It makes sense: A wide-coverage word sense disambiguation system for free text. In Proceedings of the ACL 2010 system demonstrations, pages 78–83.
    Google ScholarLocate open access versionFindings
  • Both our frozen BERT baseline and the BEM are implemented in PyTorch5 and optimized with Adam (Kingma and Ba, 2015). The pretrained models used to initialize each model are obtained through Wolf et al. (2019); we initialize every model with the bert-base-uncased encoder.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments