AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
We focus on the problem of entity linking across two heterogeneous academic entity graphs: Microsoft Academic Graph and AMiner, each of which consists of author, paper, and venue entities

OAG: Toward Linking Large-scale Heterogeneous Entity Graphs

pp.2585-2595 (2019)

被引用31|浏览591
EI
下载 PDF 全文
引用
微博一下

摘要

Linking entities from different sources is a fundamental task in building open knowledge graphs. Despite much research conducted in related fields, the challenges of linkinglarge-scale heterogeneous entity graphs are far from resolved. Employing two billion-scale academic entity graphs (Microsoft Academic Graph and AMiner) as sources for ...更多

代码

数据

0
简介
  • Called ontology alignment and disambiguation [12], is the task of determining the identity of entities across different sources.
  • Despite the bunch of research, the challenges of linking Web-scale heterogeneous entity graphs from different sources are far from resolved.
  • The Web-based entity graphs are usually heterogeneous, in the sense that they consist of various types of entities, such as author, paper, and venue entities in academic graphs.
  • There are more than 10,000 authors with the name “James Smith” in Microsoft Academic Graph (MAG) [26].
  • The scale of entity graphs on the Web is usually large, with billions of entities
重点内容
  • Entity linking, called ontology alignment and disambiguation [12], is the task of determining the identity of entities across different sources
  • To link venues, which are coarse-grained and word-sequence dependent entities, we customize the long short-term memory networks [10] (LSTM) to capture the sequential dependency in venue names; To link paper entities, which are relatively less ambiguous but at a very large scale, we present a locality-sensitive hashing [6] and convolutional neural network [14] (CNN) based techniques for fast and accurate matching; To link large-scale author entities, which are highly ambiguous, we propose heterogeneous graph attention networks (HGAT)
  • We focus on the problem of entity linking across two heterogeneous academic entity graphs: Microsoft Academic Graph and AMiner, each of which consists of author, paper, and venue entities
  • We study an important problem of linking large-scale heterogeneous entity graphs
  • We focus on building a large linked academic entity graph
  • The linked results have been published as Open Academic Graph (OAG)
方法
  • The authors compare the proposed methods with the following methods. Some methods with the same name are designed differently according to different entity characteristics.

    Keyword: – For venues: The authors use TF-IDF weighted Jaccard index to measure the similarity of two venue names. – For papers: Given a paper p ∈ N P, the authors use its title keywords to find candidate papers in CP by inverted index table and re-rank these candidates by edit distance between two titles and character-level similarity of its author names. – For authors: The authors match two authors if and only if they have exact the same full name.

    SVM: – For venues: The authors use similarity scores of venue integral sequences and keyword sequences as input features. – For papers: The authors use similarity scores of paper titles and authors as input features. – For authors: The authors use similarity scores of author names, affiliations, venues, papers and coauthors as input features.
  • Keyword: – For venues: The authors use TF-IDF weighted Jaccard index to measure the similarity of two venue names.
  • SVM: – For venues: The authors use similarity scores of venue integral sequences and keyword sequences as input features.
  • – For papers: The authors use similarity scores of paper titles and authors as input features.
  • – For authors: The authors use similarity scores of author names, affiliations, venues, papers and coauthors as input features.
结果
  • Table 1 shows the overall linking performance of different methods.
  • The overall F1-score is weighted by the number of test samples on different linking problems (i.e. 361, 9234 and 5000 test pairs for venues, papers, authors respectively).
  • The authors compare and analyze results on the linking of venues, papers and authors one by one.
  • LinKG outperforms other methods.
  • LinKGC can achieve good performance for venue linking, because CNN is capable of capturing word order matching pattern.
  • Compared with CNN, LSTM can process variablesized sequences while CNN cannot
结论
  • The authors study an important problem of linking large-scale heterogeneous entity graphs.
  • The authors focus on building a large linked academic entity graph.
  • The authors propose a unified framework, LinKG, to deal with the linking problem.
  • The authors evaluated the proposed framework and compared it with several state-of-the-art approaches.
  • Experimental results show that the proposed framework LinKG can achieve a very high linking accuracy with a F1-score of 0.9510, significantly outperforming the states-of-the-arts.
  • The linked results have been published as Open Academic Graph (OAG)
表格
  • Table1: Results of linking heterogeneous entity graphs. “–” indicates the method does not support the entity linking
  • Table2: Paper Linking performance
  • Table3: Running time of different methods for paper linking (in second)
  • Table4: Author Linking results of our model variants
Download tables as Excel
相关工作
  • In this section, we review relevant literature about entity linking. Entity linking, also known as data integration, record linkage etc., is a classical problem which was put forward more than six decades ago [7]. Some literature reviews can be found in [7, 25].

    There are various approaches to tackle this problem. Li et al [15] argue for rule-based methods and develop a rule discovery algorithm. Another important thread of research is based on machine learning algorithms. Tang et al [31] regard entity matching as minimizing Bayesian risk of decision making. Some works [22, 23] attempt to use less labeled data and employ semi-supervised or unsupervised matching algorithms. For example, Rong et al [22] transfer the entity matching problem to a binary classification problem and use pairwise similarity vectors as training data. Wang et al [35] present a factor graph model to learn the alignment across knowledge bases. For data integration across social networks or other networks, some [29, 39] incorporate network structure to develop effective algorithms. Zhang et al [39] propose COSNET, an energy-based model which considers global and local consistency of multiple networks.
基金
  • The work is supported by the NSFC for Distinguished Young Scholar (61825602) and NSFC (61836013), and a research fund supported by MSRA
  • Xiao Liu is supported by Tsinghua University Initiative Scientific Research Program and DCST Student Academic Training Program
引用论文
  • Ron Bekkerman and Andrew McCallum. 2005. Disambiguating Web Appearances of People in a Social Network. In WWW’05. 463–470.
    Google ScholarLocate open access versionFindings
  • Omar Benjelloun, Hector Garcia-Molina, David Menestrina, Qi Su, Steven Euijong Whang, and Jennifer Widom. 2009. Swoosh: a generic approach to entity resolution. The VLDB Journal 18, 1 (2009), 255–276.
    Google ScholarLocate open access versionFindings
  • Indrajit Bhattacharya and Lise Getoor. 2007. Collective entity resolution in relational data. TKDE 1, 1 (2007), 5.
    Google ScholarLocate open access versionFindings
  • Mikhail Bilenko. 200Learnable Similarity Functions and their Applications to Clustering and Record Linkage. In AAAI’04. 981–982.
    Google ScholarLocate open access versionFindings
  • George E Dahl, Tara N Sainath, and Geoffrey E Hinton. 2013. Improving deep neural networks for LVCSR using rectified linear units and dropout. In ICASSP’13. 8609–8613.
    Google ScholarLocate open access versionFindings
  • Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S Mirrokni. 2004. Localitysensitive hashing scheme based on p-stable distributions. In SCG’04. 253–262.
    Google ScholarLocate open access versionFindings
  • Ahmed K Elmagarmid, Panagiotis G Ipeirotis, and Vassilios S Verykios. 200Duplicate record detection: A survey. TKDE 19, 1 (2007), 1–16.
    Google ScholarLocate open access versionFindings
  • Hui Han, Lee Giles, Hongyuan Zha, Cheng Li, and Kostas Tsioutsiouliklis. 2004. Two Supervised Learning Approaches for Name Disambiguation in Author Citations. In JCDL’04. 296–305.
    Google ScholarLocate open access versionFindings
  • Mark Heimann, Haoming Shen, Tara Safavi, and Danai Koutra. 2018. Regal: Representation learning-based graph alignment. In CIKM’18. 117–126.
    Google ScholarLocate open access versionFindings
  • Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
    Google ScholarLocate open access versionFindings
  • Lili Jiang, Jianyong Wang, Ning An, Shengyuan Wang, Jian Zhan, and Lian Li. 2009. Grape: A graph-based framework for disambiguating people appearances in web search. In ICDM’09. 199–208.
    Google ScholarLocate open access versionFindings
  • Mahboob Alam Khalid, Valentin Jijkoun, and Maarten De Rijke. 2008. The impact of named entity normalization on information retrieval for question answering. In ECIR’08. 705–710.
    Google ScholarLocate open access versionFindings
  • Quoc V Le and Tomas Mikolov. 2014. Distributed Representations of Sentences and Documents.. In ICML, Vol. 14. 1188–1196.
    Google ScholarLocate open access versionFindings
  • Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradientbased learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278– 2324.
    Google ScholarLocate open access versionFindings
  • Lingli Li, Jianzhong Li, and Hong Gao. 20Rule-Based method for entity resolution. TKDE 27, 1 (2015), 250–263.
    Google ScholarLocate open access versionFindings
  • Xin Li, Paul Morie, and Dan Roth. 2004. Identification and tracing of ambiguous names: Discriminative and generative approaches. In AAAI’04. 419–424.
    Google ScholarLocate open access versionFindings
  • Li Liu, William K Cheung, Xin Li, and Lejian Liao. 2016. Aligning Users across Social Networks Using Network Embedding. In IJCAI. 1774–1780.
    Google ScholarFindings
  • Tong Man, Huawei Shen, Shenghua Liu, Xiaolong Jin, and Xueqi Cheng. 2016. Predict Anchor Links across Social Networks via an Embedding Approach.. In IJCAI’16. 1823–1829.
    Google ScholarLocate open access versionFindings
  • Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS’13. 3111–3119.
    Google ScholarLocate open access versionFindings
  • Andrea Moro, Alessandro Raganato, and Roberto Navigli. 2014. Entity linking meets word sense disambiguation: a unified approach. Transactions of the Association for Computational Linguistics 2 (2014), 231–244.
    Google ScholarLocate open access versionFindings
  • Jiezhong Qiu, Jian Tang, Hao Ma, Yuxiao Dong, Kuansan Wang, and Jie Tang. 2018. Deepinf: Social influence prediction with deep learning. In KDD’18. ACM, 2110–2119.
    Google ScholarLocate open access versionFindings
  • Shu Rong, Xing Niu, Evan Xiang, Haofen Wang, Qiang Yang, and Yong Yu. 2012. A machine learning approach for instance matching based on similarity metrics. In ISWC’12. 460–475.
    Google ScholarLocate open access versionFindings
  • Sunita Sarawagi and Anuradha Bhamidipaty. 2002. Interactive deduplication using active learning. In KDD’02. 269–278.
    Google ScholarLocate open access versionFindings
  • Wei Shen, Jiawei Han, and Jianyong Wang. 2014. A probabilistic model for linking named entities in web text with heterogeneous information networks. In SIGMOD’14. ACM, 1199–1210.
    Google ScholarLocate open access versionFindings
  • Wei Shen, Jianyong Wang, and Jiawei Han. 2015. Entity linking with a knowledge base: Issues, techniques, and solutions. TKDE 27, 2 (2015), 443–460.
    Google ScholarLocate open access versionFindings
  • Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-june Paul Hsu, and Kuansan Wang. 2015. An overview of microsoft academic service (mas) and applications. In WWW’15. 243–246.
    Google ScholarLocate open access versionFindings
  • Rebecca C. Steorts, Samuel L. Ventura, Mauricio Sadinle, and Stephen E. Fienberg. 2014. A Comparison of Blocking Methods for Record Linkage. In PSD’14. 253–268.
    Google ScholarLocate open access versionFindings
  • Yizhou Sun and Jiawei Han. 2013. Mining heterogeneous information networks: a structural analysis approach. Acm Sigkdd Explorations Newsletter 14, 2 (2013), 20–28.
    Google ScholarLocate open access versionFindings
  • Shulong Tan, Ziyu Guan, Deng Cai, Xuzhen Qin, Jiajun Bu, and Chun Chen. 2014. Mapping Users across Networks by Manifold Alignment on Hypergraph. In AAAI’14. 159–165.
    Google ScholarLocate open access versionFindings
  • Jie Tang, A.C.M. Fong, Bo Wang, and Jing Zhang. 2012. A Unified Probabilistic Framework for Name Disambiguation in Digital Library. IEEE TKDE 24, 6 (2012), 975–987.
    Google ScholarLocate open access versionFindings
  • Jie Tang, Juanzi Li, Bangyong Liang, Xiaotong Huang, Yi Li, and Kehong Wang. 2006. Using Bayesian decision for ontology mapping. Journal of Web Semantics 4, 4 (2006), 243–262.
    Google ScholarLocate open access versionFindings
  • Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. 2015. Line: Large-scale information network embedding. In WWW’15. 1067–1077.
    Google ScholarLocate open access versionFindings
  • Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. 2008. ArnetMiner: Extraction and Mining of Academic Social Networks. In KDD’08. 990–998.
    Google ScholarLocate open access versionFindings
  • Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph Attention Networks. ICLR (2018).
    Google ScholarLocate open access versionFindings
  • Zhichun Wang, Juanzi Li, Zhigang Wang, and Jie Tang. 2012. Cross-lingual knowledge linking across wiki knowledge bases. In WWW’12. 459–468.
    Google ScholarLocate open access versionFindings
  • Yang Yang, Yizhou Sun, Jie Tang, Bo Ma, and Juanzi Li. 2015. Entity Matching Across Heterogeneous Sources. In KDD’15. 1395–1404.
    Google ScholarLocate open access versionFindings
  • Xiaoxin Yin, Jiawei Han, and Philip S Yu. 2007. Object distinction: Distinguishing objects with identical names. In ICDE’07. 1242–1246.
    Google ScholarLocate open access versionFindings
  • Jing Zhang, Bo Chen, Xianming Wang, Hong Chen, Cuiping Li, Fengmei Jin, Guojie Song, and Yutao Zhang. 2018. MEgo2Vec: Embedding Matched Ego Networks for User Alignment Across Social Networks. In CIKM’18. 327–336.
    Google ScholarLocate open access versionFindings
  • Yutao Zhang, Jie Tang, Zhilin Yang, Jian Pei, and Philip Yu. 2015. COSNET: Connecting Heterogeneous Social Networks with Local and Global Consistency. In KDD’15. 1485–1494.
    Google ScholarLocate open access versionFindings
  • Yutao Zhang, Fanjin Zhang, Peiran Yao, and Jie Tang. 2018. Name Disambiguation in AMiner: Clustering, Maintenance, and Human in the Loop.. In KDD’18. 1002– 1011.
    Google ScholarLocate open access versionFindings
  • Yan Zhuang, Guoliang Li, Zhuojian Zhong, and Jianhua Feng. 2017. Hike: A hybrid human-machine method for entity alignment in large-scale knowledge bases. In CIKM’17. 1917–1926. (2) There is no test part in the original codes, and we copy the validation part in the original codes and complete the test part.
    Google ScholarFindings
  • (2) Add noisy data: we randomly replace some attributes of generated positive pairs, including affiliations, venues and papers.
    Google ScholarFindings
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科