AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
View the video

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We focus on the problem of concept linking between two public knowledge bases—the Open Academic Graph and the English Wikipedia

OAG know: Self-supervised Learning for Linking Knowledge Graphs

Cited by: 0|Views40874
Full Text
Bibtex
Weibo

Abstract

We propose a self-supervised embedding learning framework—SelfLinKG—to link concepts in heterogeneous knowledge graphs. Without any labeled data, SelfLinKG can achieve competitive performance against its supervised counterpart, and significantly outperforms state-of-the-art unsupervised methods by 26%-50% under linear classification proto...More
0
Introduction
  • With the goal of linking concepts of the same meaning, is critical for document-based data systems such as Academic Search (AMiner, Microsoft Academic Graph) and Question Answering Platform (Reddit, StackOverflow), where knowledge bases containing concepts and their relations are independently developed within each system to help complicated searching and reasoning.
  • These knowledge bases are incomplete, and to complement each other via concept linking is important for advanced applications.
  • The ambiguity issue with noise makes the problem more severe
Highlights
  • With the goal of linking concepts of the same meaning, is critical for document-based data systems such as Academic Search (AMiner, Microsoft Academic Graph) and Question Answering Platform (Reddit, StackOverflow), where knowledge bases containing concepts and their relations are independently developed within each system to help complicated searching and reasoning
  • We focus on the problem of concept linking between two public knowledge bases—the Open Academic Graph (OAG) [44] and the English Wikipedia
  • We propose SelfLinKG, a concept learning framework to deal with the large-scale heterogeneous concept linking problem without an arduously expensive process for producing massive labeled data
  • We propose to leverage self-supervised learning to learn the intrinsic relations between concepts across the two knowledge bases, which help mitigate the scalability issue for handling large-scale data
  • We conduct extensive experiments on OAG and Wikipedia, which suggest that SelfLinKG can achieve very high accuracy of 97.33% in the real application, significantly outperforming baseline models
  • We propose an self-supervised model SelfLinKG for linking large-scale heterogeneous knowledge bases
Methods
  • Though many embedding-based entity alignment algorithms have emerged these years, most of them focus on supervised learning and are not fair to serve as baselines for self-supervised SelfLinKG.
  • Hit@1 Hit@2 Hit@3 Hit@5 General Linking RESCAL
Results
  • Results of Unsupervised Linking and SelfLinKG

    Table 1 shows the overall linking performance on three tasks by embedding the training method: Synonym Linking, Disambiguation, and General Linking.
  • Results show that the method SelfLinKG consistently outperforms other alternatives (26%33%) in every task.
  • For Synonym Linking, the names of a positive concept pair are not identical, sometimes even very different from each other literally.
  • SelfLinKG performs the best among all methods with high recall, F1, and AUROC, which means that SelfLinKG only omits a little proportion of positive pairs, even for those with entirely different names.
  • Its comparatively low precision is probably because it assumes some negative concept pairs with similar names as correct.
  • TransE performs comparatively well, while other methods generally have low F1 scores around 30%-40%
Conclusion
  • The authors propose an self-supervised model SelfLinKG for linking large-scale heterogeneous knowledge bases.
  • SelfLinKG uses global momentum contrastive learning to learn a shared representation among multiple knowledge bases.
  • The authors' experiments on two largescale graphs show that the proposed unsupervised SelfLinKG can achieve a comparable performance with its supervised counterpart.
  • The authors apply the model to automatically generate linkings among 14 different knowledge bases and make the linked graphs publicly available.
  • It would be rather interesting to design a knowledge linking system to automatically harvest knowledge from the open Web. It would be exciting to explore novel methods to make the model more robust, as the open data is always noisy
Tables
  • Table1: Results of linking performances under unsupervised settings
  • Table2: Statistics for Disambiguation Task Dataset that uses vectors with complex values and retains the mathematical definition of the dot product. • HolE [<a class="ref-link" id="c19" href="#r19">19</a>]: Holographic embeddings (HOLE) is a method to learn compositional vector space representations of entire knowledge graphs. It is related to holographic models of associative memory in that it employs the circular correlation to create compositional representations. • DistMult [<a class="ref-link" id="c43" href="#r43">43</a>]: This method focuses on the study of neural-embedding models, where the representations are learned using neural networks with energy-based objectives. • GAKE [<a class="ref-link" id="c8" href="#r8">8</a>]: This method formulates the knowledge base as a directed graph, and learns representations for any vertices or edges by leveraging the graph’s structural information. In this method, three types of graph context for embedding are introduced: neighbor context, path context, and edge context; each reflects properties of knowledge from different perspectives. • SelfLinKG: self-supervised SelfLinKG. In this method, we input the subgraph of a concept, and trained the shared encoder across MAG and EnWiki with only the instance discrimination pre-train task, which is an unsupervised method. The unique embedding has no sharing in these settings, i.e., every single concept has its unique embedding
  • Table3: Ablation study on momentum value m
  • Table4: Ablation study on multi-head attention
  • Table5: Overall Statistics of OAGknow. T Taxonomy ; E Encyclopedia; K Knowledge Graph
Download tables as Excel
Related work
  • Concept linking, closely related to entity linking, ontology alignment, schema matching and data integration etc., has long been studied for decades [7]. Many approaches have been proposed to address this problem. For example, Li et al [15] argue for rule-based methods and develop a rule discovery algorithm. Tang et al [28] use machine learning and regard concept linking as minimizing Bayesian risk of decision making. As the size of KB increases, many semisupervised or unsupervised methods appear. For example, Rong et al [23] transfer the entity matching problem to a binary classification problem. Wang et al [40] present a factor graph model to learn the alignment across knowledge bases. For data integration across social networks, Zhang et al [46] propose COSNET, an energy-based model that considers global and local consistency. Pellissier et al [21] utilize existed hyper-links and build an online platform for tagging manually.
Funding
  • The work is supported by the National Key R&D Program of China (2018YFB1402600), NSFC for Distinguished Young Scholar (61825602), NSFC (61836013) and Tsinghua-Bosch Joint ML Center, Deptment of Computer Science and Technology, Tsinghua University
Study subjects and analysis
published papers: 208915369
MAG taxonomy. MAG taxonomy consists of 679,921 concepts collected from the Internet and 873,087 hypernym relations generated according to co-occurrence from 208,915,369 published papers. As some concepts in MAG are isolated — have no relations with the other concepts, we filter them out

sample pairs: 10082
To systematically evaluate the proposed methodology, we design the following four tasks:. • Synonym Linking: We utilize the redirect link in EnWiki to build a dataset consisting of 10,082 sample pairs, among which 5,041 are synonym concept pairs, and others are negative sample pairs by sampling similar terms. Following [31], we put the pair of embedding into a multi-layer neural network as a classifier to output the similarity score

samples: 20100
Because in the disambiguation task, the negative samples should only involve those ambiguous ones, Hit@K would be a more objective metric rather than Prec./Rec./F1 in which case negative samples are randomly selected and lead to a virtual-high result. • General Linking: We build a challenging dataset contains both simple matching cases and synonym cases, altogether 20,100 samples, including 70% simple matching cases and 30% synonym concepts cases and harden it by sampling negative concepts that have similar semantic embedding with positive pairs. The dataset has 60% for training, 20% for validation, and 20% for testing

Reference
  • A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko. Translating embeddings for modeling multirelational data. In NIPS, pages 2787–2795, 2013.
    Google ScholarLocate open access versionFindings
  • M. Chen, Y. Tian, M. Yang, and C. Zaniolo. Multilingual knowledge graph embeddings for cross-lingual knowledge alignment. arXiv preprint arXiv:1611.03954, 2016.
    Findings
  • K. Clark, U. Khandelwal, O. Levy, and C. D. Manning. What does bert look at? an analysis of bert’s attention. arXiv preprint arXiv:1906.04341, 2019.
    Findings
  • K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning. Electra: Pretraining text encoders as discriminators rather than generators. In ICLR, 2019.
    Google ScholarLocate open access versionFindings
  • J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
    Findings
  • X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, and W. Zhang. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 601–610, 2014.
    Google ScholarLocate open access versionFindings
  • A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. TKDE, 19(1):1–16, 2007.
    Google ScholarLocate open access versionFindings
  • J. Feng, M. Huang, Y. Yang, and X. Zhu. Gake: Graph aware knowledge embedding. In COLING 2016: Technical Papers, pages 641–651, 2016.
    Google ScholarLocate open access versionFindings
  • R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant mapping. In (CVPR’06), volume 2, pages 1735–1742. IEEE, 2006.
    Google ScholarLocate open access versionFindings
  • X. Han, S. Cao, X. Lv, Y. Lin, Z. Liu, M. Sun, and J. Li. Openke: An open toolkit for knowledge embedding. In EMNLP, pages 139–144, 2018.
    Google ScholarLocate open access versionFindings
  • K. Hassani and A. H. Khasahmadi. Contrastive multi-view representation learning on graphs. arXiv preprint arXiv:2006.05582, 2020.
    Findings
  • K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick. Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722, 2019.
    Findings
  • Q. Le and T. Mikolov. Distributed representations of sentences and documents. In ICML, pages 1188–1196, 2014.
    Google ScholarLocate open access versionFindings
  • J. Li, J. Tang, Y. Li, and Q. Luo. Rimom: A dynamic multistrategy ontology alignment framework. IEEE TKDE, 21(8):1218–1232, 2008.
    Google ScholarLocate open access versionFindings
  • L. Li, J. Li, and H. Gao. Rule-based method for entity resolution. TKDE, 27(1):250–263, 2015.
    Google ScholarLocate open access versionFindings
  • L. Liu, W. K. Cheung, X. Li, and L. Liao. Aligning users across social networks using network embedding. In Ijcai, pages 1774– 1780, 2016.
    Google ScholarLocate open access versionFindings
  • X. Liu, F. Zhang, Z. Hou, Z. Wang, L. Mian, J. Zhang, and J. Tang. Self-supervised learning: Generative or contrastive. arXiv, pages arXiv–2006, 2020.
    Google ScholarFindings
  • T. Man, H. Shen, S. Liu, X. Jin, and X. Cheng. Predict anchor links across social networks via an embedding approach. In Ijcai, volume 16, pages 1823–1829, 2016.
    Google ScholarLocate open access versionFindings
  • N. Maximilian, R. Lorenzo, and P. Tomaso. Holographic embeddings of knowledge graphs. arXiv preprint arXiv:1510.04935, 2015.
    Findings
  • M. Nickel, V. Tresp, and H.-P. Kriegel. A three-way model for collective learning on multi-relational data. In Icml, volume 11, pages 809–816, 2011.
    Google ScholarLocate open access versionFindings
  • T. Pellissier Tanon, D. Vrandecic, S. Schaffert, T. Steiner, and L. Pintscher. From freebase to wikidata: The great migration. In WWW, pages 1419–1428, 2016.
    Google ScholarLocate open access versionFindings
  • J. Qiu, Q. Chen, Y. Dong, J. Zhang, H. Yang, M. Ding, K. Wang, and J. Tang. Gcc: Graph contrastive coding for graph neural network pre-training. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1150–1160, 2020.
    Google ScholarLocate open access versionFindings
  • S. Rong, X. Niu, E. Xiang, H. Wang, Q. Yang, and Y. Yu. A machine learning approach for instance matching based on similarity metrics. In ISWC’12, pages 460–475, 2012.
    Google ScholarLocate open access versionFindings
  • W. Shen, J. Wang, P. Luo, and M. Wang. Linking named entities in tweets with knowledge base via user interest modeling. In KDD’13, pages 68–76, 2013.
    Google ScholarLocate open access versionFindings
  • Z. Shen, H. Ma, and K. Wang. A web-scale system for scientific knowledge exploration. arXiv preprint arXiv:1805.12216, 2018.
    Findings
  • A. Sinha, Z. Shen, Y. Song, H. Ma, D. Eide, B.-J. Hsu, and K. Wang. An overview of microsoft academic service (mas) and applications. In Proceedings of the 24th WWW, pages 243–246, 2015.
    Google ScholarLocate open access versionFindings
  • Z. Sun, W. Hu, and C. Li. Cross-lingual entity alignment via joint attribute-preserving embedding. In ISWC, pages 628–644.
    Google ScholarLocate open access versionFindings
  • J. Tang, J. Li, B. Liang, X. Huang, Y. Li, and K. Wang. Using bayesian decision for ontology mapping. Journal of Web Semantics, 4(4):243–262, 2006.
    Google ScholarLocate open access versionFindings
  • J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su. Arnetminer: extraction and mining of academic social networks. In SIGKDD, pages 990–998, 2008.
    Google ScholarLocate open access versionFindings
  • Y. Tian, D. Krishnan, and P. Isola. Contrastive multiview coding. arXiv preprint arXiv:1906.05849, 2019.
    Findings
  • R. Trivedi, B. Sisman, X. L. Dong, C. Faloutsos, J. Ma, and H. Zha. Linknbed: Multi-graph representation learning with entity linkage. In ACL, pages 252–262, 2018.
    Google ScholarLocate open access versionFindings
  • T. Trouillon, J. Welbl, S. Riedel, E. Gaussier, and G. Bouchard. Complex embeddings for simple link prediction. ICML, 2016.
    Google ScholarLocate open access versionFindings
  • H. Van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double q-learning. In AAAI, 2016.
    Google ScholarLocate open access versionFindings
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
    Google ScholarLocate open access versionFindings
  • P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio. Graph Attention Networks. ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • P. Velickovic, W. Fedus, W. L. Hamilton, P. Lio, Y. Bengio, and R. D. Hjelm. Deep graph infomax. arXiv:1809.10341, 2018.
    Findings
  • E. Voita, D. Talbot, F. Moiseev, R. Sennrich, and I. Titov. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. arXiv preprint arXiv:1905.09418, 2019.
    Findings
  • T. Wang and P. Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. arXiv preprint arXiv:2005.10242, 2020.
    Findings
  • Z. Wang, J. Li, Z. Wang, S. Li, M. Li, D. Zhang, Y. Shi, Y. Liu, P. Zhang, and J. Tang. Xlore: A large-scale english-chinese bilingual knowledge graph. In ISWC, volume 1035, pages 121–124, 2013.
    Google ScholarLocate open access versionFindings
  • Z. Wang, J. Li, Z. Wang, and J. Tang. Cross-lingual knowledge linking across wiki knowledge bases. In WWW’12, pages 459–468, 2012.
    Google ScholarLocate open access versionFindings
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew. Huggingface’s transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771, 2019.
    Findings
  • Z. Wu, Y. Xiong, S. X. Yu, and D. Lin. Unsupervised feature learning via non-parametric instance discrimination. In CVPR, pages 3733–3742, 2018.
    Google ScholarLocate open access versionFindings
  • B. Yang, W.-t. Yih, X. He, J. Gao, and L. Deng. Embedding entities and relations for learning and inference in knowledge bases. arXiv preprint arXiv:1412.6575, 2014.
    Findings
  • F. Zhang, X. Liu, J. Tang, Y. Dong, P. Yao, J. Zhang, X. Gu, Y. Wang, B. Shao, R. Li, et al. Oag: Toward linking large-scale heterogeneous entity graphs. In SIGKDD, pages 2585–2595, 2019.
    Google ScholarLocate open access versionFindings
  • J. Zhang, B. Chen, X. Wang, H. Chen, C. Li, F. Jin, G. Song, and Y. Zhang. Mego2vec: Embedding matched ego networks for user alignment across social networks. In CIKM, pages 327–336, 2018.
    Google ScholarLocate open access versionFindings
  • Y. Zhang, J. Tang, Z. Yang, J. Pei, and P. Yu. Cosnet: Connecting heterogeneous social networks with local and global consistency. In KDD’15, pages 1485–1494, 2015.
    Google ScholarLocate open access versionFindings
  • H. Zhu, R. Xie, Z. Liu, and M. Sun. Iterative entity alignment via joint knowledge embeddings. In IJCAI, pages 4258–4264, 2017. Li Mian received bachelor degree(2020) from Department of Computer Science, Beijing Institute of Technology. She is now admitted into a graduate program in Georgia Institute of Technology. Her research interests focus on data mining, natural language processing and machine learning.
    Google ScholarLocate open access versionFindings
  • Yuxiao Dong received his Ph.D. in Computer Science from University of Notre Dame in 2017. He is a senior researcher at Microsoft Research Redmond. His research focuses on social networks, data mining, and graph representation learning.
    Google ScholarFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科