AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
The automatically acquired predominant senses were evaluated against the hand-tagged resources SemCor and the SENSEVAL-2 English all-words task giving us a word sense disambiguation precision of 64% on an all-nouns task

Finding predominant word senses in untagged text

ACL, pp.279-286, (2004)

被引用412|浏览295
EI
下载 PDF 全文
引用
微博一下

摘要

In word sense disambiguation (WSD), the heuristic of choosing the most common sense is extremely powerful because the distribution of the senses of a word is often skewed. The problem with using the predominant, or first sense heuristic, aside from the fact that it does not take surrounding context into account, is that it assumes some qu...更多

代码

数据

0
简介
  • The first sense heuristic which is often used as a baseline for supervised WSD systems outperforms many of these systems which take surrounding context into account
  • This is shown by the results of the English all-words task in SENSEVAL-2 (Cotton et al, 1998) in figure 1 below, where the first sense is that listed in WordNet for the PoS given by the Penn TreeBank (Palmer et al, 2001).
  • Whilst a first sense heuristic based on a sense-tagged corpus such as SemCor is clearly useful, there is a strong case for obtaining a first, or predominant, sense from untagged corpus data so that a WSD system can be tuned to the genre or domain at hand
重点内容
  • The first sense heuristic which is often used as a baseline for supervised word sense disambiguation (WSD) systems outperforms many of these systems which take surrounding context into account
  • Whilst a first sense heuristic based on a sense-tagged corpus such as SemCor is clearly useful, there is a strong case for obtaining a first, or predominant, sense from untagged corpus data so that a WSD system can be tuned to the genre or domain at hand
  • We have devised a method that uses raw corpus data to automatically find a predominant sense for nouns in WordNet
  • The automatically acquired predominant senses were evaluated against the hand-tagged resources SemCor and the SENSEVAL-2 English all-words task giving us a WSD precision of 64% on an all-nouns task
  • The merit of our technique is the very possibility of obtaining predominant senses from the data at hand
  • We have demonstrated the possibility of finding predominant senses in domain specific corpora on a sample of nouns
方法
  • In order to find the predominant sense of a target word the authors use a thesaurus acquired from automati-

    £ cally parsed text based on the method of Lin (1998).

    This provides the nearest neighbours to each target word, along with the distributional similarity score between the target word and its neighbour.
  • In order to find the predominant sense of a target word the authors use a thesaurus acquired from automati-.
  • £ cally parsed text based on the method of Lin (1998).
  • This provides the nearest neighbours to each target word, along with the distributional similarity score between the target word and its neighbour.
  • ¤ ordered from the thesaurus with associated distributional similariswntL(inwteeyeegni(7etigssPQges(ehcnhc4ob8Rotorotrfieh(ues(se4¤r4bt(%h©'(ty01Fa(e&)(¤ ¤5rDGs(4gWu3('('(7e8Hm010B69bot¤§Eerm¥s8@Dd2et3Ni¦hn)n( e)segt4e3thmsoase(&)(vuitt¤5m(4elmot('('r(7iifp0101l6ata)¤§¤hlsxireeieai32mnt)dny&Asdiweb(ss3se('ycetsh0Boo!a¤Coe!rft"ehbw&)¤sit(E(saee.¤I('Dinisng013sFc¤§heoooa(t2rfr.(ore )eeTaf,#aanbhd3Eccke$iihhDs---.
结果
  • The results in table 1 show the accuracy of the ranking with respect to SemCor over the entire set of 2595 polysemous nouns in SemCor with

    5ŽQ†i‘’’i‘1“2 ” 4We repeated the experiment with the BNC data for jcn using and the number of neighbours used gave only minimal changes to the results so the authors do not report them here.

    measure lesk jcn baseline T bŠ‚`‚ % ‹—b  v ‚ %

    Tsehneseraonvdeormalblatsheelisneewfoorrdcsho(uosi¦ nprg•9th– ee†—†pv r˜evxd gov‚m‚v“iƒfn¦a… n˜ )t i(tshušis3™ 2p%b%› a`s.œželiBn– eoe.t™ h– #2TWg hvoer˜ v‚dr Nag nv‚e xdtvBoƒ s™m…im˜ )biliaassrei2tl4yin%em. efoaAsrug‹—raeisnb ,b tehav e‚t automatic ranking outperforms this by a large margin.
  • The results in table 1 show the accuracy of the ranking with respect to SemCor over the entire set of 2595 polysemous nouns in SemCor with.
  • The first sense in SemCor provides an upperbound for this task of 67%.
  • Since both measures gave comparable results the authors restricted the remaining experiments to jcn because this gave good results for finding the predominant sense, and is much more efficient than lesk, given the precompilation of the IC files
结论
  • There are cases where the acquired first sense disagrees with SemCor, yet is intuitively plausible.
  • The automatically acquired predominant senses were evaluated against the hand-tagged resources SemCor and the SENSEVAL-2 English all-words task giving them a WSD precision of 64% on an all-nouns task
  • This is just 5% lower than results using the first sense in the manually labelled SemCor, and the authors obtain 67% precision on polysemous nouns that are not in SemCor. In many cases the sense ranking provided in SemCor differs to that obtained automatically because the authors used the BNC to produce the thesaurus.
  • The authors will use balanced and domain specific corpora to isolate words having very different neighbours, and rankings, in the different corpora and to detect and target words for which there is a highly skewed sense distribution in these corpora
总结
  • Introduction:

    The first sense heuristic which is often used as a baseline for supervised WSD systems outperforms many of these systems which take surrounding context into account
  • This is shown by the results of the English all-words task in SENSEVAL-2 (Cotton et al, 1998) in figure 1 below, where the first sense is that listed in WordNet for the PoS given by the Penn TreeBank (Palmer et al, 2001).
  • Whilst a first sense heuristic based on a sense-tagged corpus such as SemCor is clearly useful, there is a strong case for obtaining a first, or predominant, sense from untagged corpus data so that a WSD system can be tuned to the genre or domain at hand
  • Methods:

    In order to find the predominant sense of a target word the authors use a thesaurus acquired from automati-

    £ cally parsed text based on the method of Lin (1998).

    This provides the nearest neighbours to each target word, along with the distributional similarity score between the target word and its neighbour.
  • In order to find the predominant sense of a target word the authors use a thesaurus acquired from automati-.
  • £ cally parsed text based on the method of Lin (1998).
  • This provides the nearest neighbours to each target word, along with the distributional similarity score between the target word and its neighbour.
  • ¤ ordered from the thesaurus with associated distributional similariswntL(inwteeyeegni(7etigssPQges(ehcnhc4ob8Rotorotrfieh(ues(se4¤r4bt(%h©'(ty01Fa(e&)(¤ ¤5rDGs(4gWu3('('(7e8Hm010B69bot¤§Eerm¥s8@Dd2et3Ni¦hn)n( e)segt4e3thmsoase(&)(vuitt¤5m(4elmot('('r(7iifp0101l6ata)¤§¤hlsxireeieai32mnt)dny&Asdiweb(ss3se('ycetsh0Boo!a¤Coe!rft"ehbw&)¤sit(E(saee.¤I('Dinisng013sFc¤§heoooa(t2rfr.(ore )eeTaf,#aanbhd3Eccke$iihhDs---.
  • Results:

    The results in table 1 show the accuracy of the ranking with respect to SemCor over the entire set of 2595 polysemous nouns in SemCor with

    5ŽQ†i‘’’i‘1“2 ” 4We repeated the experiment with the BNC data for jcn using and the number of neighbours used gave only minimal changes to the results so the authors do not report them here.

    measure lesk jcn baseline T bŠ‚`‚ % ‹—b  v ‚ %

    Tsehneseraonvdeormalblatsheelisneewfoorrdcsho(uosi¦ nprg•9th– ee†—†pv r˜evxd gov‚m‚v“iƒfn¦a… n˜ )t i(tshušis3™ 2p%b%› a`s.œželiBn– eoe.t™ h– #2TWg hvoer˜ v‚dr Nag nv‚e xdtvBoƒ s™m…im˜ )biliaassrei2tl4yin%em. efoaAsrug‹—raeisnb ,b tehav e‚t automatic ranking outperforms this by a large margin.
  • The results in table 1 show the accuracy of the ranking with respect to SemCor over the entire set of 2595 polysemous nouns in SemCor with.
  • The first sense in SemCor provides an upperbound for this task of 67%.
  • Since both measures gave comparable results the authors restricted the remaining experiments to jcn because this gave good results for finding the predominant sense, and is much more efficient than lesk, given the precompilation of the IC files
  • Conclusion:

    There are cases where the acquired first sense disagrees with SemCor, yet is intuitively plausible.
  • The automatically acquired predominant senses were evaluated against the hand-tagged resources SemCor and the SENSEVAL-2 English all-words task giving them a WSD precision of 64% on an all-nouns task
  • This is just 5% lower than results using the first sense in the manually labelled SemCor, and the authors obtain 67% precision on polysemous nouns that are not in SemCor. In many cases the sense ranking provided in SemCor differs to that obtained automatically because the authors used the BNC to produce the thesaurus.
  • The authors will use balanced and domain specific corpora to isolate words having very different neighbours, and rankings, in the different corpora and to detect and target words for which there is a highly skewed sense distribution in these corpora
表格
  • Table1: SemCor results the jcn and lesk WordNet similarity measures
  • Table2: Evaluating predominant sense information on SENSEVAL-2 all-words data
  • Table3: Domain specific results
Download tables as Excel
相关工作
  • Most research in WSD concentrates on using contextual features, typically neighbouring words, to help determine the correct sense of a target word. In contrast, our work is aimed at discovering the predominant senses from raw text because the first sense heuristic is such a useful one, and because handtagged data is not always available.

    A major benefit of our work, rather than reliance on hand-tagged training data such as SemCor, is that this method permits us to produce predominant senses for the domain and text type required. Buitelaar and Sacaleanu (2001) have previously explored ranking and selection of synsets in GermaNet for specific domains using the words in a given synset, and those related by hyponymy, and a term relevance measure taken from information retrieval. Buitelaar and Sacaleanu have evaluated their method on identifying domain specific concepts using human judgements on 100 items. We have evaluated our method using publically available resources, both for balanced and domain specific text. Magnini and Cavaglia (2000) have identified WordNet word senses with particular domains, and this has proven useful for high precision WSD (Magnini et al, 2001); indeed in section 5 we used these domain labels for evaluation. Identification of these domain labels for word senses was semiautomatic and required a considerable amount of hand-labelling. Our approach is complementary to this. It only requires raw text from the given domain and because of this it can easily be applied to a new domain, or sense inventory, given sufficient text.
基金
  • This work was funded by EU-2001-34460 project MEANING: Developing Multilingual Web-scale Language Technologies, UK EPSRC project Robust Accurate Statistical Parsing (RASP) and a UK EPSRC studentship
引用论文
  • Satanjeev Banerjee and Ted Pedersen. 2002. An adapted Lesk algorithm for word sense disambiguation using WordNet. In Proceedings of the Third International Conference on Intelligent Text Processing and Computational Linguistics (CICLing-02), Mexico City.
    Google ScholarLocate open access versionFindings
  • Edward Briscoe and John Carroll. 200Robust accurate statistical annotation of general text. In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC), pages 1499–1504, Las Palmas, Canary Islands, Spain.
    Google ScholarLocate open access versionFindings
  • Paul Buitelaar and Bogdan Sacaleanu. 2001. Ranking and selecting synsets by domain relevance. In Proceedings of WordNet and Other Lexical Resources: Applications, Extensions and Customizations, NAACL 2001 Workshop, Pittsburgh, PA.
    Google ScholarLocate open access versionFindings
  • Massimiliano Ciaramita and Mark Johnson. 2003. Supersense tagging of unknown nouns in WordNet. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2003).
    Google ScholarLocate open access versionFindings
  • Scott Cotton, Phil Edmonds, Adam Kilgarriff, and Martha Palmer. 1998. SENSEVAL-2. http://www.sle.sharp.co.uk/senseval2/.
    Findings
  • Jordi Daude, Lluis Padro, and German Rigau. 2000. Mapping wordnets using structural information. In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, Hong Kong.
    Google ScholarLocate open access versionFindings
  • Veronique Hoste, Anne Kool, and Walter Daelemans. 2001. Classifier optimization and combination in the English all words task. In Proceedings of the SENSEVAL-2 workshop, pages 84–86.
    Google ScholarLocate open access versionFindings
  • Jay Jiang and David Conrath. 1997. Semantic similarity based on corpus statistics and lexical taxonomy. In International Conference on Research in Computational Linguistics, Taiwan.
    Google ScholarLocate open access versionFindings
  • Anna Korhonen. 2002. Semantically motivated subcategorization acquisition. In Proceedings of the ACL Workshop on Unsupervised Lexical Acquisition, Philadelphia, USA.
    Google ScholarLocate open access versionFindings
  • Mirella Lapata and Chris Brew. 2004. Verb class disambiguation using informative priors. Computational Linguistics, 30(1):45–75.
    Google ScholarLocate open access versionFindings
  • Beth Levin. 1993. English Verb Classes and Alternations: a Preliminary Investigation. University of Chicago Press, Chicago and London.
    Google ScholarFindings
  • Dekang Lin. 1998. Automatic retrieval and clustering of similar words. In Proceedings of COLING-ACL 98, Montreal, Canada.
    Google ScholarLocate open access versionFindings
  • Bernardo Magnini and Gabriela Cavaglia. 2000. Integrating subject field codes into WordNet. In Proceedings of LREC-2000, Athens, Greece.
    Google ScholarLocate open access versionFindings
  • Bernardo Magnini, Carlo Strapparava, Giovanni Pezzuli, and Alfio Gliozzo. 2001. Using domain information for word sense disambiguation. In Proceedings of the SENSEVAL-2 workshop, pages 111–114.
    Google ScholarLocate open access versionFindings
  • Diana McCarthy, Rob Koeling, Julie Weeds, and John Carrolł. 2004. Using automatically acquired predominant senses for word sense disambiguation. In Proceedings of the ACL SENSEVAL-3 workshop.
    Google ScholarLocate open access versionFindings
  • Diana McCarthy. 1997. Word sense disambiguation for acquisition of selectional preferences. In Proceedings of the ACL/EACL 97 Workshop Automatic Information Extraction and Building of Lexical Semantic Resources for NLP Applications, pages 52–61.
    Google ScholarLocate open access versionFindings
  • Paola Merlo and Matthias Leybold. 2001. Automatic distinction of arguments and modifiers: the case of prepositional phrases. In Proceedings of the Workshop on Computational Language Learning (CoNLL 2001), Toulouse, France.
    Google ScholarLocate open access versionFindings
  • George A. Miller, Claudia Leacock, Randee Tengi, and Ross T Bunker. 1993. A semantic concordance. In Proceedings of the ARPA Workshop on Human Language Technology, pages 303–308. Morgan Kaufman.
    Google ScholarLocate open access versionFindings
您的评分 :
0

 

最佳论文
2004年, 荣获ACL的最佳论文奖
标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科