AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
View the video
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
We focus on the problem of concept linking between two public knowledge bases—the Open Academic Graph and the English Wikipedia
OAG know: Self-supervised Learning for Linking Knowledge Graphs
We propose a self-supervised embedding learning framework—SelfLinKG—to link concepts in heterogeneous knowledge graphs. Without any labeled data, SelfLinKG can achieve competitive performance against its supervised counterpart, and significantly outperforms state-of-the-art unsupervised methods by 26%-50% under linear classification proto...More
PPT (Upload PPT)
- With the goal of linking concepts of the same meaning, is critical for document-based data systems such as Academic Search (AMiner, Microsoft Academic Graph) and Question Answering Platform (Reddit, StackOverflow), where knowledge bases containing concepts and their relations are independently developed within each system to help complicated searching and reasoning.
- These knowledge bases are incomplete, and to complement each other via concept linking is important for advanced applications.
- The ambiguity issue with noise makes the problem more severe
- With the goal of linking concepts of the same meaning, is critical for document-based data systems such as Academic Search (AMiner, Microsoft Academic Graph) and Question Answering Platform (Reddit, StackOverflow), where knowledge bases containing concepts and their relations are independently developed within each system to help complicated searching and reasoning
- We focus on the problem of concept linking between two public knowledge bases—the Open Academic Graph (OAG)  and the English Wikipedia
- We propose SelfLinKG, a concept learning framework to deal with the large-scale heterogeneous concept linking problem without an arduously expensive process for producing massive labeled data
- We propose to leverage self-supervised learning to learn the intrinsic relations between concepts across the two knowledge bases, which help mitigate the scalability issue for handling large-scale data
- We conduct extensive experiments on OAG and Wikipedia, which suggest that SelfLinKG can achieve very high accuracy of 97.33% in the real application, significantly outperforming baseline models
- We propose an self-supervised model SelfLinKG for linking large-scale heterogeneous knowledge bases
- Though many embedding-based entity alignment algorithms have emerged these years, most of them focus on supervised learning and are not fair to serve as baselines for self-supervised SelfLinKG.
- Hit@1 Hit@2 Hit@3 Hit@5 General Linking RESCAL
- Results of Unsupervised Linking and SelfLinKG
Table 1 shows the overall linking performance on three tasks by embedding the training method: Synonym Linking, Disambiguation, and General Linking.
- Results show that the method SelfLinKG consistently outperforms other alternatives (26%33%) in every task.
- For Synonym Linking, the names of a positive concept pair are not identical, sometimes even very different from each other literally.
- SelfLinKG performs the best among all methods with high recall, F1, and AUROC, which means that SelfLinKG only omits a little proportion of positive pairs, even for those with entirely different names.
- Its comparatively low precision is probably because it assumes some negative concept pairs with similar names as correct.
- TransE performs comparatively well, while other methods generally have low F1 scores around 30%-40%
- The authors propose an self-supervised model SelfLinKG for linking large-scale heterogeneous knowledge bases.
- SelfLinKG uses global momentum contrastive learning to learn a shared representation among multiple knowledge bases.
- The authors' experiments on two largescale graphs show that the proposed unsupervised SelfLinKG can achieve a comparable performance with its supervised counterpart.
- The authors apply the model to automatically generate linkings among 14 different knowledge bases and make the linked graphs publicly available.
- It would be rather interesting to design a knowledge linking system to automatically harvest knowledge from the open Web. It would be exciting to explore novel methods to make the model more robust, as the open data is always noisy
- Table1: Results of linking performances under unsupervised settings
- Table2: Statistics for Disambiguation Task Dataset that uses vectors with complex values and retains the mathematical definition of the dot product. • HolE [<a class="ref-link" id="c19" href="#r19">19</a>]: Holographic embeddings (HOLE) is a method to learn compositional vector space representations of entire knowledge graphs. It is related to holographic models of associative memory in that it employs the circular correlation to create compositional representations. • DistMult [<a class="ref-link" id="c43" href="#r43">43</a>]: This method focuses on the study of neural-embedding models, where the representations are learned using neural networks with energy-based objectives. • GAKE [<a class="ref-link" id="c8" href="#r8">8</a>]: This method formulates the knowledge base as a directed graph, and learns representations for any vertices or edges by leveraging the graph’s structural information. In this method, three types of graph context for embedding are introduced: neighbor context, path context, and edge context; each reflects properties of knowledge from different perspectives. • SelfLinKG: self-supervised SelfLinKG. In this method, we input the subgraph of a concept, and trained the shared encoder across MAG and EnWiki with only the instance discrimination pre-train task, which is an unsupervised method. The unique embedding has no sharing in these settings, i.e., every single concept has its unique embedding
- Table3: Ablation study on momentum value m
- Table4: Ablation study on multi-head attention
- Table5: Overall Statistics of OAGknow. T Taxonomy ; E Encyclopedia; K Knowledge Graph
- Concept linking, closely related to entity linking, ontology alignment, schema matching and data integration etc., has long been studied for decades . Many approaches have been proposed to address this problem. For example, Li et al  argue for rule-based methods and develop a rule discovery algorithm. Tang et al  use machine learning and regard concept linking as minimizing Bayesian risk of decision making. As the size of KB increases, many semisupervised or unsupervised methods appear. For example, Rong et al  transfer the entity matching problem to a binary classification problem. Wang et al  present a factor graph model to learn the alignment across knowledge bases. For data integration across social networks, Zhang et al  propose COSNET, an energy-based model that considers global and local consistency. Pellissier et al  utilize existed hyper-links and build an online platform for tagging manually.
- The work is supported by the National Key R&D Program of China (2018YFB1402600), NSFC for Distinguished Young Scholar (61825602), NSFC (61836013) and Tsinghua-Bosch Joint ML Center, Deptment of Computer Science and Technology, Tsinghua University
Study subjects and analysis
published papers: 208915369
MAG taxonomy. MAG taxonomy consists of 679,921 concepts collected from the Internet and 873,087 hypernym relations generated according to co-occurrence from 208,915,369 published papers. As some concepts in MAG are isolated — have no relations with the other concepts, we filter them out
sample pairs: 10082
To systematically evaluate the proposed methodology, we design the following four tasks:. • Synonym Linking: We utilize the redirect link in EnWiki to build a dataset consisting of 10,082 sample pairs, among which 5,041 are synonym concept pairs, and others are negative sample pairs by sampling similar terms. Following , we put the pair of embedding into a multi-layer neural network as a classifier to output the similarity score
Because in the disambiguation task, the negative samples should only involve those ambiguous ones, Hit@K would be a more objective metric rather than Prec./Rec./F1 in which case negative samples are randomly selected and lead to a virtual-high result. • General Linking: We build a challenging dataset contains both simple matching cases and synonym cases, altogether 20,100 samples, including 70% simple matching cases and 30% synonym concepts cases and harden it by sampling negative concepts that have similar semantic embedding with positive pairs. The dataset has 60% for training, 20% for validation, and 20% for testing
- A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko. Translating embeddings for modeling multirelational data. In NIPS, pages 2787–2795, 2013.
- M. Chen, Y. Tian, M. Yang, and C. Zaniolo. Multilingual knowledge graph embeddings for cross-lingual knowledge alignment. arXiv preprint arXiv:1611.03954, 2016.
- K. Clark, U. Khandelwal, O. Levy, and C. D. Manning. What does bert look at? an analysis of bert’s attention. arXiv preprint arXiv:1906.04341, 2019.
- K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning. Electra: Pretraining text encoders as discriminators rather than generators. In ICLR, 2019.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, and W. Zhang. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 601–610, 2014.
- A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. TKDE, 19(1):1–16, 2007.
- J. Feng, M. Huang, Y. Yang, and X. Zhu. Gake: Graph aware knowledge embedding. In COLING 2016: Technical Papers, pages 641–651, 2016.
- R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant mapping. In (CVPR’06), volume 2, pages 1735–1742. IEEE, 2006.
- X. Han, S. Cao, X. Lv, Y. Lin, Z. Liu, M. Sun, and J. Li. Openke: An open toolkit for knowledge embedding. In EMNLP, pages 139–144, 2018.
- K. Hassani and A. H. Khasahmadi. Contrastive multi-view representation learning on graphs. arXiv preprint arXiv:2006.05582, 2020.
- K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick. Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722, 2019.
- Q. Le and T. Mikolov. Distributed representations of sentences and documents. In ICML, pages 1188–1196, 2014.
- J. Li, J. Tang, Y. Li, and Q. Luo. Rimom: A dynamic multistrategy ontology alignment framework. IEEE TKDE, 21(8):1218–1232, 2008.
- L. Li, J. Li, and H. Gao. Rule-based method for entity resolution. TKDE, 27(1):250–263, 2015.
- L. Liu, W. K. Cheung, X. Li, and L. Liao. Aligning users across social networks using network embedding. In Ijcai, pages 1774– 1780, 2016.
- X. Liu, F. Zhang, Z. Hou, Z. Wang, L. Mian, J. Zhang, and J. Tang. Self-supervised learning: Generative or contrastive. arXiv, pages arXiv–2006, 2020.
- T. Man, H. Shen, S. Liu, X. Jin, and X. Cheng. Predict anchor links across social networks via an embedding approach. In Ijcai, volume 16, pages 1823–1829, 2016.
- N. Maximilian, R. Lorenzo, and P. Tomaso. Holographic embeddings of knowledge graphs. arXiv preprint arXiv:1510.04935, 2015.
- M. Nickel, V. Tresp, and H.-P. Kriegel. A three-way model for collective learning on multi-relational data. In Icml, volume 11, pages 809–816, 2011.
- T. Pellissier Tanon, D. Vrandecic, S. Schaffert, T. Steiner, and L. Pintscher. From freebase to wikidata: The great migration. In WWW, pages 1419–1428, 2016.
- J. Qiu, Q. Chen, Y. Dong, J. Zhang, H. Yang, M. Ding, K. Wang, and J. Tang. Gcc: Graph contrastive coding for graph neural network pre-training. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1150–1160, 2020.
- S. Rong, X. Niu, E. Xiang, H. Wang, Q. Yang, and Y. Yu. A machine learning approach for instance matching based on similarity metrics. In ISWC’12, pages 460–475, 2012.
- W. Shen, J. Wang, P. Luo, and M. Wang. Linking named entities in tweets with knowledge base via user interest modeling. In KDD’13, pages 68–76, 2013.
- Z. Shen, H. Ma, and K. Wang. A web-scale system for scientific knowledge exploration. arXiv preprint arXiv:1805.12216, 2018.
- A. Sinha, Z. Shen, Y. Song, H. Ma, D. Eide, B.-J. Hsu, and K. Wang. An overview of microsoft academic service (mas) and applications. In Proceedings of the 24th WWW, pages 243–246, 2015.
- Z. Sun, W. Hu, and C. Li. Cross-lingual entity alignment via joint attribute-preserving embedding. In ISWC, pages 628–644.
- J. Tang, J. Li, B. Liang, X. Huang, Y. Li, and K. Wang. Using bayesian decision for ontology mapping. Journal of Web Semantics, 4(4):243–262, 2006.
- J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su. Arnetminer: extraction and mining of academic social networks. In SIGKDD, pages 990–998, 2008.
- Y. Tian, D. Krishnan, and P. Isola. Contrastive multiview coding. arXiv preprint arXiv:1906.05849, 2019.
- R. Trivedi, B. Sisman, X. L. Dong, C. Faloutsos, J. Ma, and H. Zha. Linknbed: Multi-graph representation learning with entity linkage. In ACL, pages 252–262, 2018.
- T. Trouillon, J. Welbl, S. Riedel, E. Gaussier, and G. Bouchard. Complex embeddings for simple link prediction. ICML, 2016.
- H. Van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double q-learning. In AAAI, 2016.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
- P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio. Graph Attention Networks. ICLR, 2018.
- P. Velickovic, W. Fedus, W. L. Hamilton, P. Lio, Y. Bengio, and R. D. Hjelm. Deep graph infomax. arXiv:1809.10341, 2018.
- E. Voita, D. Talbot, F. Moiseev, R. Sennrich, and I. Titov. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. arXiv preprint arXiv:1905.09418, 2019.
- T. Wang and P. Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. arXiv preprint arXiv:2005.10242, 2020.
- Z. Wang, J. Li, Z. Wang, S. Li, M. Li, D. Zhang, Y. Shi, Y. Liu, P. Zhang, and J. Tang. Xlore: A large-scale english-chinese bilingual knowledge graph. In ISWC, volume 1035, pages 121–124, 2013.
- Z. Wang, J. Li, Z. Wang, and J. Tang. Cross-lingual knowledge linking across wiki knowledge bases. In WWW’12, pages 459–468, 2012.
- T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew. Huggingface’s transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771, 2019.
- Z. Wu, Y. Xiong, S. X. Yu, and D. Lin. Unsupervised feature learning via non-parametric instance discrimination. In CVPR, pages 3733–3742, 2018.
- B. Yang, W.-t. Yih, X. He, J. Gao, and L. Deng. Embedding entities and relations for learning and inference in knowledge bases. arXiv preprint arXiv:1412.6575, 2014.
- F. Zhang, X. Liu, J. Tang, Y. Dong, P. Yao, J. Zhang, X. Gu, Y. Wang, B. Shao, R. Li, et al. Oag: Toward linking large-scale heterogeneous entity graphs. In SIGKDD, pages 2585–2595, 2019.
- J. Zhang, B. Chen, X. Wang, H. Chen, C. Li, F. Jin, G. Song, and Y. Zhang. Mego2vec: Embedding matched ego networks for user alignment across social networks. In CIKM, pages 327–336, 2018.
- Y. Zhang, J. Tang, Z. Yang, J. Pei, and P. Yu. Cosnet: Connecting heterogeneous social networks with local and global consistency. In KDD’15, pages 1485–1494, 2015.
- H. Zhu, R. Xie, Z. Liu, and M. Sun. Iterative entity alignment via joint knowledge embeddings. In IJCAI, pages 4258–4264, 2017. Li Mian received bachelor degree(2020) from Department of Computer Science, Beijing Institute of Technology. She is now admitted into a graduate program in Georgia Institute of Technology. Her research interests focus on data mining, natural language processing and machine learning.
- Yuxiao Dong received his Ph.D. in Computer Science from University of Notre Dame in 2017. He is a senior researcher at Microsoft Research Redmond. His research focuses on social networks, data mining, and graph representation learning.