ArnetMiner: extraction and mining of academic social networks

KDD, pp. 990-998, 2008.

Cited by: 1485|Bibtex|Views454|Links
EI WOS SCOPUS
Keywords:
modeling resultsearch serviceacademic networkpeople association searchunified modeling approachMore(10+)
Wei bo:
We propose a probabilistic framework to deal with the name ambiguity problem in the integration

Abstract:

This paper addresses several key issues in the ArnetMiner system, which aims at extracting and mining academic social networks. Specifically, the system focuses on: 1) Extracting researcher profiles automatically from the Web; 2) Integrating the publication data into the network from existing digital libraries; 3) Modeling the entire acad...More

Code:

Data:

Introduction
  • Extraction and mining of academic social networks aims at providing comprehensive services in the scientific research field.
  • In an academic social network, people are not only interested in searching for different types of information, but are interested in finding semantics-based information.
  • The social information obtained from user-entered profiles or by extraction using heuristics is sometimes incomplete or inconsistent; 2) Lack of a unified approach to efficiently model the academic network.
  • Different types of information in the academic network were modeled individually, dependencies between them cannot be captured accurately
Highlights
  • Extraction and mining of academic social networks aims at providing comprehensive services in the scientific research field
  • Compared with the previous topic modeling work, in this paper, we propose a unified topic model to simultaneously model the topical aspects of different types of information in the academic network
  • We describe the architecture and the main features of the ArnetMiner system
  • We further propose a unified topic model to simultaneously model the different types of information in the academic network
  • The modeling results have been applied to expertise search and association search
Conclusion
  • The authors describe the architecture and the main features of the ArnetMiner system.
  • The authors propose a unified tagging approach to researcher profiling.
  • About a half million researcher profiles have been extracted into the system.
  • The system has integrated more than one million papers.
  • The authors propose a probabilistic framework to deal with the name ambiguity problem in the integration.
  • The authors further propose a unified topic model to simultaneously model the different types of information in the academic network.
  • The authors conduct experiments for evaluating each of the proposed approaches.
  • Experimental results indicate that the proposed methods can achieve a high performance
Summary
  • Introduction:

    Extraction and mining of academic social networks aims at providing comprehensive services in the scientific research field.
  • In an academic social network, people are not only interested in searching for different types of information, but are interested in finding semantics-based information.
  • The social information obtained from user-entered profiles or by extraction using heuristics is sometimes incomplete or inconsistent; 2) Lack of a unified approach to efficiently model the academic network.
  • Different types of information in the academic network were modeled individually, dependencies between them cannot be captured accurately
  • Conclusion:

    The authors describe the architecture and the main features of the ArnetMiner system.
  • The authors propose a unified tagging approach to researcher profiling.
  • About a half million researcher profiles have been extracted into the system.
  • The system has integrated more than one million papers.
  • The authors propose a probabilistic framework to deal with the name ambiguity problem in the integration.
  • The authors further propose a unified topic model to simultaneously model the different types of information in the academic network.
  • The authors conduct experiments for evaluating each of the proposed approaches.
  • Experimental results indicate that the proposed methods can achieve a high performance
Tables
  • Table1: Content features, Pattern features, and term features
  • Table2: Relationships between papers
  • Table3: Data set for name disambiguation
  • Table4: Results on name disambiguation (%)
  • Table5: Five topics discovered by ACT1 on the Arnetminer data. Each topic is shown with the top 8 words and their corresponding probabilities. Top 6 authors and top 6 conferences are shown with each topic. The titles are our interpretation of the topics
  • Table6: Performance of six expertise search approaches (%)
  • Table7: Top 5 representative words and top 5 authors associated to two conferences found by ACT1
  • Table8: Top 5 representative words and top 5 conferences associated to two researchers found by ACT1
Download tables as Excel
Related work
  • 2.1 Person Profile Extraction

    Several research efforts have been made for extracting person profiles. For example, Yu et al [32] propose a two-stage extraction method for identifying personal information from resumes. The first stage segments a resume into different types of blocks and the second stage extracts the detailed information such as Address and Email from the identified blocks. However, the method formalizes the profile extraction as several separate steps and conducts extraction in a more or less ad-hoc manner.

    A few efforts also have been placed on the extraction of contact information from emails or from the Web. For example, Kristjansson et al [19] have developed an interactive information extraction system to assist the user to populate a contact database from emails. In comparison, profile extraction consists of contact information extraction as well as other different subtasks.
Funding
  • The work is supported by the National Natural Science Foundation of China (90604025, 60703059), Chinese National Key Foundation Research and Development Plan (2007CB310803), and Chinese Young Faculty Research Funding (20070003093)
  • It is also supported by IBM Innovation funding
Study subjects and analysis
cases: 1325
Statistical study also unveils that (strong) dependencies exist between different profile properties. For example, there are 1, 325 cases (14.54%) in our data of which the extraction needs to use the extraction results of other properties. An ideal method should consider processing all the subtasks holistically

papers: 2
We define five types of relationships between papers (Table 2). Relationship r1 represents two papers are published at the same venue. Relationship r2 means two papers have a secondary author with the same name, and relationship r3 means one paper cites the other paper

papers: 2
Relationship r1 represents two papers are published at the same venue. Relationship r2 means two papers have a secondary author with the same name, and relationship r3 means one paper cites the other paper. Relationship r4 indicates a constraint-based relationship supplied via user feedback

specific papers: 2
Relationship r4 indicates a constraint-based relationship supplied via user feedback. For instance, the user can specify that two specific papers should be assigned to a same person. We use an example to explain relationship r5

persons: 14134
We collected a list of the most frequent queries from the log of ArnetMiner for evaluation. We conducted experiments on a subset of the data (including 14, 134 persons, 10, 716 papers, and 1, 434 conferences) from ArnetMiner. For evaluation, we used the method of pooled relevance judgments [10] together with human judgments

faculty members: 2
Specifically, for each query, we first pooled the top 30 results from three similar systems (Libra, Rexa, and ArnetMiner). Then, two faculty members and five graduate students from CS provided human judgments. Four-grade scores (3, 2, 1, and 0) were assigned respectively representing definite expertise, expertise, marginal expertise, and no expertise

papers: 1000000
About a half million researcher profiles have been extracted into the system. The system has also integrated more than one million papers. We propose a probabilistic framework to deal with the name ambiguity problem in the integration

Reference
  • L. A. Adamic and E. Adar. How to search a social network. Social Networks, 27:187–203, 2005.
    Google ScholarLocate open access versionFindings
  • C. Andrieu, N. de Freitas, A. Doucet, and M. I. Jordan. An introduction to mcmc for machine learning. Machine Learning, 50:5–43, 2003.
    Google ScholarLocate open access versionFindings
  • R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. ACM Press, 1999.
    Google ScholarLocate open access versionFindings
  • K. Balog, L. Azzopardi, and M. de Rijke. Formal models for expert finding in enterprise corpora. In Proc. of SIGIR’06, pages 43–55, 2006.
    Google ScholarLocate open access versionFindings
  • S. Basu, M. Bilenko, and R. J. Mooney. A probabilistic framework for semi-supervised clustering. In Proc. of KDD’04, pages 59–68, 2004.
    Google ScholarLocate open access versionFindings
  • R. Bekkerman and A. McCallum. Disambiguating web appearances of people in a social network. In Proc. of WWW’05, pages 463–470, 2005.
    Google ScholarLocate open access versionFindings
  • D. M. Blei and J. D. McAuliffe. Supervised topic models. In Proc. of NIPS’07, 2007.
    Google ScholarLocate open access versionFindings
  • D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, 2003.
    Google ScholarLocate open access versionFindings
  • D. Brickley and L. Miller. Foaf vocabulary specification. In Namespace Document, http://xmlns.com/foaf/0.1/, September 2004.
    Locate open access versionFindings
  • C. Buckley and E. M. Voorhees. Retrieval evaluation with incomplete information. In Proc. of SIGIR’04, pages 25–32, 2004.
    Google ScholarLocate open access versionFindings
  • F. Ciravegna. An adaptive algorithm for information extraction from web-related texts. In Proc. of IJCAI’01 Workshop, August 2001.
    Google ScholarLocate open access versionFindings
  • C. Cortes and V. Vapnikn. Support-vector networks. Machine Learning, 20:273–297, 1995.
    Google ScholarLocate open access versionFindings
  • N. Craswell, A. P. de Vries, and I. Soboroff. Overview of the trec-2005 enterprise track. In TREC’05, pages 199–205, 2005.
    Google ScholarLocate open access versionFindings
  • H. Han, L. Giles, H. Zha, C. Li, and K. Tsioutsiouliklis. Two supervised learning approaches for name disambiguation in author citations. In Proc. of JCDL’04, pages 296–305, 2004.
    Google ScholarLocate open access versionFindings
  • H. Han, H. Zha, and C. L. Giles. Name disambiguation in author citations using a k-way spectral clustering method. In Proc. of JCDL’05, pages 334–343, 2005.
    Google ScholarLocate open access versionFindings
  • T. Hofmann. Collaborative filerting via gaussian probabilistic latent semantic analysis. In Proc.of SIGIR’03, pages 259–266, 1999.
    Google ScholarLocate open access versionFindings
  • T. Hofmann. Probabilistic latent semantic indexing. In Proc.of SIGIR’99, pages 50–57, 1999.
    Google ScholarLocate open access versionFindings
  • H. Kautz, B. Selman, and M. Shah. Referral web: Combining social networks and collaborative filtering. Communications of the ACM, 40(3):63–65, 1997.
    Google ScholarLocate open access versionFindings
  • T. Kristjansson, A. Culotta, P. Viola, and A. McCallum. Interactive information extraction with constrained conditional random fields. In Proc. of AAAI’04, 2004.
    Google ScholarLocate open access versionFindings
  • J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. of ICML’01, 2001.
    Google ScholarLocate open access versionFindings
  • A. McCallum. Multi-label text classification with a mixture model trained by em. In Proc. of AAAI’99 Workshop, 1999.
    Google ScholarLocate open access versionFindings
  • D. Mimno and A. McCallum. Expertise modeling for matching papers with reviewers. In Proc. of KDD’07, pages 500–509, 2007.
    Google ScholarLocate open access versionFindings
  • T. Minka. Estimating a dirichlet distribution. In Technique Report, http://research.microsoft.com/minka/papers/dirichlet/, 2003.
    Findings
  • Z. Nie, Y. Ma, S. Shi, J.-R. Wen, and W.-Y. Ma. Web object retrieval. In Proc. of WWW’07, pages 81–90, 2007.
    Google ScholarLocate open access versionFindings
  • M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth. The author-topic model for authors and documents. In Proc. of UAI’04, 2004.
    Google ScholarLocate open access versionFindings
  • M. Steyvers, P. Smyth, and T. Griffiths. Probabilistic author-topic models for information discovery. In Proc. of SIGKDD’04, 2004.
    Google ScholarLocate open access versionFindings
  • Y. F. Tan, M.-Y. Kan, and D. Lee. Search engine driven author disambiguation. In Proc. of JCDL’06, pages 314–315, 2006.
    Google ScholarLocate open access versionFindings
  • J. Tang, D. Zhang, and L. Yao. Social network extraction of academic researchers. In Proc. of ICDM’07, pages 292–301, 2007.
    Google ScholarLocate open access versionFindings
  • X. Wei and W. B. Croft. Lda-based document models for ad-hoc retrieval. In Proc. of SIGIR’06, pages 178–185, 2006.
    Google ScholarLocate open access versionFindings
  • E. Xun, C. Huang, and M. Zhou. A unified statistical model for the identification of english basenp. In Proc. of ACL’00, 2000.
    Google ScholarLocate open access versionFindings
  • X. Yin, J. Han, and P. Yu. Object distinction: Distinguishing objects with identical names. In Proc. of ICDE’2007, pages 1242–1246, 2007.
    Google ScholarLocate open access versionFindings
  • K. Yu, G. Guan, and M. Zhou. Resume information extraction with cascaded hybrid model. In Proc. of ACL’05, pages 499–506, 2005.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments