AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
We present SynSetExpan, a novel framework jointly conducting two tasks, and SE2 dataset, the first large-scale synonym-enhanced set expansion dataset

SynSetExpan: An Iterative Framework for Joint Entity Set Expansion and Synonym Discovery

EMNLP 2020, pp.8292-8307, (2020)

被引用0|浏览224
下载 PDF 全文
引用
微博一下

摘要

Entity set expansion and synonym discovery are two critical NLP tasks. Previous studies accomplish them separately, without exploring their interdependencies. In this work, we hypothesize that these two tasks are tightly coupled because two synonymous entities tend to have a similar likelihood of belonging to various semantic classes. Thi...更多

代码

数据

0
简介
  • Entity set expansion (ESE) aims to expand a small set of seed entities (e.g., {“United States”, “Canada”}) into a larger set of entities that belong to the same semantic class (i.e., Country).
  • Entity synonym discovery (ESD) intends to group all terms in a vocabulary that refer to the same realworld entity (e.g., “America” and “USA” refer to the same country) into a synonym set
  • Those discovered entities and synsets include rich knowledge and can benefit many downstream applications such as semantic search (Xiong User Provided Seed Synsets S Illinois IL Land of Lincoln Vocabulary V.
重点内容
  • Entity set expansion (ESE) aims to expand a small set of seed entities (e.g., {“United States”, “Canada”}) into a larger set of entities that belong to the same semantic class (i.e., Country)
  • Given (1) a text corpus D, (2) a vocabulary V derived from D, and (3) a seed set of user-provided entity synonym sets S0 that belong to the same semantic class C, we aim to (1) select a subset of entities VC from V that all belong to C; and (2) clusters all terms in VC into entity synsets SVC where the union of all clusters is equal to VC
  • We can see that SynSetExpan-NoSYN achieves comparable performances with the current state-of-the-art methods on Wiki and APR datasets7, and outperforms previous methods on SE2 dataset, which demonstrates the effectiveness of our set expansion model alone
  • This paper shows entity set expansion and synonym discovery are two tightly coupled tasks and can mutually enhance each other
  • We present SynSetExpan, a novel framework jointly conducting two tasks, and SE2 dataset, the first large-scale synonym-enhanced set expansion dataset
  • Extensive experiments on SE2 and several other benchmark datasets demonstrate the effectiveness of SynSetExpan on both tasks
方法
  • The authors compare the following corpus-based set expansion methods: (1) EgoSet (Rong et al, 2016): A method initially proposed for multifaceted set expansion using skipgrams and word2vec embeddings.
  • (3) SetExpander (Mamou et al, 2018b): A one-time entity ranking method based on multi-context term similarity defined on multiple embeddings.
  • (9) SynSetExpan-NoSYN: A variant of the proposed SynSetExpan framework without the synonym discovery model.
  • The authors compare following synonym discovery methods: (1) SVM: A classification method trained on given term pair features.
  • (5) SynSetExpan: The authors' proposed framework that fine-tunes synonym discovery model using set expansion results.
  • More implementation details and hyper-parameter choices are discussed in supplementary materials Section G
结果
  • The authors analyze the set expansion performance from the following aspects.

    1. Overall Performance.
  • The authors find that SynSetExpan can further improve SynSetExpan-NoFT via model fine-tuning, which demonstrates that set expansion can help synonym discovery.
  • The authors can see that SynSetExpan is able to detect different types of entity synsets across various semantic classes.
  • The authors highlight those entities discovered only after model fine-tuning, and the authors can see clearly that with fine-tuning, the SynSetExpan framework can detect more accurate synsets
结论
  • This paper shows entity set expansion and synonym discovery are two tightly coupled tasks and can mutually enhance each other.
  • The authors present SynSetExpan, a novel framework jointly conducting two tasks, and SE2 dataset, the first large-scale synonym-enhanced set expansion dataset.
  • Extensive experiments on SE2 and several other benchmark datasets demonstrate the effectiveness of SynSetExpan on both tasks.
  • The authors plan to study how the authors can apply SynSetExpan at the entity mention level for conducting contextualized synonym discovery and set expansion
表格
  • Table1: Our SE2 dataset statistics pus, a vocabulary with labeled synsets, a set of complete semantic classes, and a list of seed queries. However, to the best of our knowledge, there is no such a public benchmark4. Therefore, we build the first Synonym Enhanced Set Expansion (SE2) benchmark dataset in this study5
  • Table2: Difficulty of each semantic class for entity set expansion (ESE) and entity synonym discovery (ESD)
  • Table3: Set expansion results on three datasets. MCTS and SetCoExpan do not scale to the SE2 dataset. SynSetExpan-Full is inapplicable for Wiki and APR datasets because they contain no synonym information. The superscript ∗ indicates the improvement is statistically significant compared to SynSetExpan-NoSYN
  • Table4: Ratio of semantic classes on which SynSetExpan outperforms SynSetExpan-NoSYN
  • Table5: Ratio of seed queries from the SE2 dataset on which the first method outperforms the second one
  • Table6: Synonym discovery results on both SE2 dataset and PubMed dataset
  • Table7: All entity pair features used in our synonym discovery model
  • Table8: Comparison of ESE datasets
  • Table9: Example Semantic Classes in SE2 Dataset
Download tables as Excel
相关工作
  • Entity Set Expansion. Entity set expansion can benefit many downstream applications such as question answering (Wang and Cohen, 2008), literature search (Shen et al, 2018b), and online education (Yu et al, 2019a). Traditional entity set expansion systems such as GoogleSet (Tong and Dean, 2008) and SEAL (Wang and Cohen, 2007) require seed-oriented online data extraction, which can be time-consuming and costly. Thus, more recent studies (Shen et al, 2017; Mamou et al, 2018b; Yu et al, 2019c; Huang et al, 2020; Zhang et al, 2020) are proposed to expand the seed set by offline processing a given corpus. These corpusbased methods include two general approaches: (1) one-time entity ranking (Pantel et al, 2009; He and Xin, 2011; Mamou et al, 2018b; Kushilevitz et al, 2020) which calculates all candidate entities’ distributional similarities with seed entities and makes a one-time ranking without back and forth refinement, and (2) iterative bootstrapping (Rong et al, 2016; Shen et al, 2017; Huang et al, 2020; Zhang et al, 2020) which starts from seed entities to extract quality textual patterns; applies the extracted patterns to obtain more quality entities, and iterates this process until sufficient entities are discovered. In this work, in addition to just adding entities into the set, we go beyond one step and aim to organize those expanded entities into synonym sets. Furthermore, we show those detected synonym sets can in turn help to improve set expansion results. Synonym Discovery. Early efforts on synonym discovery focus on finding entity synonyms from structured or semi-structured data such as query logs (Ren and Cheng, 2015), web tables (He et al, 2016), and synonymy dictionaries (Ustalov et al, 2017b,a). In comparison, this work aims to develop a method to extract synonym sets directly from raw text corpus. Given a corpus and a term list, one can leverage surface string (Wang et al, 2019), co-occurrence statistics (Baroni and Bisi, 2004), textual pattern (Yahya et al, 2014), distributional similarity (Wang et al, 2015), or their combinations (Qu et al, 2017; Fei et al, 2019) to extract synonyms. These methods mostly find synonymous term pairs or a rank list of query entity’s synonym, instead of entity synonym sets. Some studies propose to further cut-off the rank list into a set output (Ren and Cheng, 2015) or to build a synonym graph and then apply graph clustering techniques to derive synonym sets (Oliveira and Gomes, 2014; Ustalov et al, 2017b). However, they all operate directly on the entire input vocabulary which can be too extensive and noisy. Comparing to them, our approach can leverage the semantic class information detected from set expansion to enhance the synonym set discovery process.
基金
  • Research was sponsored in part by US DARPA KAIROS Program No FA8750-19-2-1004 and SocialSim Program No W911NF-17-C0099, NSF IIS 16-18481, IIS 17-04532, and IIS 17-41317, and DTRA HDTRA11810026
研究对象与分析
workers: 3
On average, workers spend 40 seconds on each task and are paid $0.1. All class, entity pairs are labeled by three workers independently and the inter-annotator agreement is 0.8204, measured by Fleiss’s Kappa (k). Finally, we enrich each semantic class Cj by adding the entity ei whose corresponding pair Cj, ei is labeled “True” by at least two workers

pairs: 7625
Class Type ESE ESD (Lexical) ESE (Semantic). Location Person Product Facility Organization Misc terms mapped to the same entity in WikiData as positive pairs and ask two human annotators to label the remaining 7,625 pairs. The inter-annotator agreement is 0.8431, measured by Fleiss’s Kappa

public datasets: 3
Datasets. We evaluate SynSetExpan on three public datasets. The first two are benchmark datasets widely used in previous studies (Shen et al, 2017; Yan et al, 2019; Zhang et al, 2020): (1) Wiki, which contains 8 semantic classes, 40 seed queries, and a subset of English Wikipedia articles, and (2) APR, which includes 3 semantic classes, 15 seed queries, and all news articles published by Associated Precess and Reuters in 2015

synonym pairs: 60186
Datasets. We evaluate SynSetExpan for synonym discovery task on two datasets: (1) SE2, which contains 60,186 synonym pairs (3,067 positive pairs and 57,119 negative pairs), and (2) PubMed, a public benchmark used in (Qu et al, 2017; Shen et al, 2019), which contains 203,648 synonym pairs (10,486 positive pairs and 193,162 negative pairs). More details can be found in supplementary materials Section G.1

引用论文
  • Marco Baroni and Sabrina Bisi. 2004. Using cooccurrence statistics and the web to discover synonyms in a technical language. In LREC.
    Google ScholarFindings
  • Chandra Bhagavatula, Thanapon Noraset, and Doug Downey. 2015. Tabel: Entity linking in web tables. In ISWC.
    Google ScholarFindings
  • Vincent D. Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. 2008. Fast unfolding of communities in large networks.
    Google ScholarFindings
  • Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. Enriching word vectors with subword information. TACL.
    Google ScholarLocate open access versionFindings
  • Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In KDD.
    Google ScholarFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT.
    Google ScholarFindings
  • Hongliang Fei, Shulong Tan, and Ping Li. 2019. Hierarchical multi-task word embedding learning for synonym prediction. In KDD.
    Google ScholarFindings
  • Faegheh Hasibi, Fedor Nikolaev, Chenyan Xiong, Krisztian Balog, Svein Erik Bratsberg, Alexander Kotov, and James P. Callan. 2017. Dbpedia-entity v2: A test collection for entity search. In SIGIR.
    Google ScholarFindings
  • Yeye He, Kaushik Chakrabarti, Tao Cheng, and Tomasz Tylenda. 2016. Automatic discovery of attribute synonyms using query logs and table corpora. In WWW.
    Google ScholarFindings
  • Yeye He and Dong Xin. 2011. Seisa: set expansion by iterative similarity aggregation. In WWW.
    Google ScholarFindings
  • Jiaxin Huang, Yiqing Xie, Yu Meng, Jiaming Shen, Yunyi Zhang, and Jiawei Han. 2020. Guiding corpusbased set expansion by auxiliary sets generation and co-expansion.
    Google ScholarFindings
  • Guy Kushilevitz, Shaul Markovitch, and Yoav Goldberg. 2020. A two-stage masked lm method for term set expansion. In ACL.
    Google ScholarFindings
  • Jonathan Mamou, Oren Pereg, Moshe Wasserblat, Ido Dagan, Yoav Goldberg, Alon Eirew, Yael Green, Shira Guskin, Peter Izsak, and Daniel Korat. 2018a. Term set expansion based on multi-context term embeddings: an end-to-end workflow. In COLING.
    Google ScholarFindings
  • Jonathan Mamou, Oren Pereg, Moshe Wasserblat, Alon Eirew, Yael Green, Shira Guskin, Peter Izsak, and Daniel Korat. 2018b. Term set expansion based nlp architect by intel ai lab. In EMNLP.
    Google ScholarFindings
  • Oren Melamud, David McClosky, Siddharth Patwardhan, and Mohit Bansal. 2016. The role of context types and dimensionality in learning word embeddings. In HLT-NAACL.
    Google ScholarFindings
  • Yu Meng, Jiaxin Huang, Guangyuan Wang, Chao Zhang, Honglei Zhuang, Lance Kaplan, and Jiawei Han. 2019. Spherical text embedding. In NeurlPS.
    Google ScholarFindings
  • Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS.
    Google ScholarFindings
  • Hugo Gonalo Oliveira and Paulo Gomes. 2014. Eco and onto.pt: a flexible approach for creating a portuguese wordnet automatically. Language Resources and Evaluation, 48:373–393.
    Google ScholarLocate open access versionFindings
  • Patrick Pantel, Eric Crestan, Arkady Borkovsky, AnaMaria Popescu, and Vishnu Vyas. 2009. Web-scale distributional similarity and entity set expansion. In EMNLP.
    Google ScholarFindings
  • Meng Qu, Xiang Ren, and Jiawei Han. 2017. Automatic synonym discovery with knowledge bases. In KDD.
    Google ScholarFindings
  • Xiang Ren and Tao Cheng. 2015. Synonym discovery for structured entities on heterogeneous graphs. In WWW.
    Google ScholarFindings
  • Xin Rong, Zhe Chen, Qiaozhu Mei, and Eytan Adar. 2016. Egoset: Exploiting word ego-networks and user-generated ontology for multifaceted set expansion. In WSDM.
    Google ScholarFindings
  • Jiaming Shen, Ruiilang Lv, Xiang Ren, Michelle Vanni, Brian Sadler, and Jiawei Han. 2019. Mining entity synonyms with efficient neural set generation. In AAAI.
    Google ScholarFindings
  • Jiaming Shen, Zeqiu Wu, Dongming Lei, Jingbo Shang, Xiang Ren, and Jiawei Han. 2017. Setexpan: Corpusbased set expansion via context feature selection and rank ensemble. In ECML/PKDD.
    Google ScholarFindings
  • Jiaming Shen, Zeqiu Wu, Dongming Lei, Chao Zhang, Xiang Ren, Michelle T. Vanni, Brian M. Sadler, and Jiawei Han. 2018a. Hiexpan: Task-guided taxonomy construction by hierarchical tree expansion. In KDD.
    Google ScholarFindings
  • Jiaming Shen, Jinfeng Xiao, Xinwei He, Jingbo Shang, Saurabh Sinha, and Jiawei Han. 2018b. Entity set search of scientific literature: An unsupervised ranking approach. In SIGIR.
    Google ScholarFindings
  • Simon Tong and Jeff Dean. 2008. System and methods for automatically creating lists. US Patent 7,350,187.
    Google ScholarFindings
  • Dmitry Ustalov, Mikhail Chernoskutov, Christian Biemann, and Alexander Panchenko. 2017a. Fighting with the sparsity of synonymy dictionaries for automatic synset induction. In AIST.
    Google ScholarFindings
  • Dmitry Ustalov, Alexander Panchenko, and Christian Biemann. 2017b. Watset: Automatic induction of synsets from a graph of synonyms. In ACL.
    Google ScholarFindings
  • Huazheng Wang, Bin Gao, Jiang Bian, Fei Tian, and Tie-Yan Liu. 2015. Solving verbal comprehension questions in iq test by knowledge-powered word embedding. CoRR, abs/1505.07909.
    Findings
  • Richard C. Wang and William W. Cohen. 2007. Language-independent set expansion of named entities using the web. In ICDM.
    Google ScholarFindings
  • Richard C. Wang and William W. Cohen. 2008. Iterative set expansion of named entities using the web. In ICDM.
    Google ScholarFindings
  • Zhen Wang, Xiang An Yue, Soheil Moosavinasab, Yungui Huang, Simon Lin, and Huan Sun. 2019. Surfcon: Synonym discovery on privacy-aware clinical data. In KDD.
    Google ScholarFindings
  • Chenyan Xiong, Russell Power, and James P. Callan. 2017. Explicit semantic ranking for academic search via knowledge graph embedding. In WWW.
    Google ScholarFindings
  • Mohamed Yahya, Steven Euijong Whang, Rahul Gupta, and Alon Y. Halevy. 2014. Renoun: Fact extraction for nominal attributes. In EMNLP.
    Google ScholarFindings
  • Lingyong Yan, Xianpei Han, Le Sun, and Ben He. 2019. Learning to bootstrap for entity set expansion. In EMNLP.
    Google ScholarFindings
  • Jifan Yu, Chenyu Wang, Gan Luo, Lei Hou, Juan-Zi Li, Zhiyuan Liu, and Jie Tang. 2019a. Course concept expansion in moocs with external knowledge and interactive game. In ACL.
    Google ScholarFindings
  • Puxuan Yu, Zhiqi Huang, Razieh Rahimi, and James Allan. 2019b. Efficient corpus-based set expansion with lexico-syntactic features and distributed representations. In SIGIR.
    Google ScholarFindings
  • Puxuan Yu, Zhiqi Huang, Razieh Rahimi, and James D Allan. 2019c. Corpus-based set expansion with lexical features and distributed representations. In SIGIR.
    Google ScholarFindings
  • Shuo Zhang and Krisztian Balog. 2018. On-the-fly table generation. In SIGIR.
    Google ScholarLocate open access versionFindings
  • Yunyi Zhang, Jiaming Shen, Jingbo Shang, and Jiawei Han. 2020. Empower entity set expansion via language model probing. In ACL.
    Google ScholarFindings
  • Wanzheng Zhu, Hongyu Gong, Jiaming Shen, Chao Zhang, Jingbo Shang, S. Bhat, and J. Han. 2020. Fuse: Multi-faceted set expansion by coherent clustering of skip-grams. In ECMLPKDD.
    Google ScholarFindings
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科