AMiner Dataset

NameNodeEdgeDescription
189 datasetsProvide comprehensive open datasets about COVID-19 all over the world
1572277 papers2084019 citation relationshipsCitation network
2,092,356 papers/1,712,433 authors8,024,869 citation relationships/4,258,615 coauthor relationshipscitation and coauthor networks
4794 authors2164 advisor-advisee,3932 coauthor relationshipsAdvisor-advisee network
640134 authors of 8 topics1554643 coauthor relationshipsTopic based Coauthor network
33739 authors of 5 topics139278 coauthor relationshipsCreated for cross domain recommendation
2329760 papers12710347 citations relationshipsTopic based citation network
8000 papers of 27 conferencesCreated for community detection
1629217 authors2623832 coauthor relationshipsAn evolving coauthor network with 27 time stamps
898 filesCreated for researcher profile extraction
155 citation pairsCreated to study the semantics of the citation relationships
1781 experts of 13 topicsA benchmark for expert finding
8369 author pairs of 9 topicsCreated for association search
Top 1000000 papers and authors of 200 topicsThe results of ACT model on AMiner dataset
1560640 authors4258946 coauthor relationshipsCoauthor network
110 authors and their affiliations/papers(a) 6,730 papers for 100 author names; (b) 1,085 Web pages for 12 person names; (c) 755 ambiguous entities appearing in 20 news pages.
emails of 2,000 people and gender of 2,400 peopleCreated for web use profiling
57,037 persons and 42,230 affiliationsCreated for studying career trajectories of scholars
two data collections: SNS and AcademicCreated for network integration
166,192,182 papers from MAG 154,771,162 papers from AMiner, and 64,639,608 linking (matching) relationsCreated for studying the integration of multiple academic graphs
23,823 names and 83,980 personsCreated for studying author name disambiguation
908 concepts, 206,240 experts and 512,698 publicationsA knowledge graph consisting of concepts, experts, and papers in Computer Science
130,750 scholars, 343,746 scholarily articales, 229,937 specialties from 103 conferences
100,000 tags, 318,406 scholars, 63,068 organizations and 23,709 venuesA structured entity network extracted from AMiner
二级节点23个,三级节点309个Knowledge Graph for Data Mining
二级节点11个,三级节点212个Knowledge Graph for Knowledge Graph
9992 experts with the greatest h-index in AMinerTrajectories of 9992 experts with the greatest h-index in AMiner science 1978
机器学习八级知识图谱Knowledge Graph for Machine Learning
External dataset

COVID-19 Open Datasets  

[Download]

For fighting against COVID-19 pandemic, open and comprehensive big data may help researchers, officials, medical staffs and crowds to understand the virus and pandemic more. The team have been collecting all kinds of open datasets about COVID-19 and keeps updating everyday. The datasets include pandemic, research, knowledge graph, media reports and so on.

Citation  

[Download]

The data set is designed for research purpose only. The citation data is extracted from DBLP, ACM, and other sources. The first version contains 629,814 papers and 632,752 citations. Each paper is associated with abstract, authors, year, venue, and title.

The data set can be used for clustering with network and side information, studying influence in the citation network, finding the most influential papers, topic modeling analysis, etc.

A larger version will be released soon.

Academic Social Network  

[Download]

The content of this data includes paper information, paper citation, author information and author collaboration. 2,092,356 papers and 8,024,869 citations between them are saved in the file  AMiner-Paper.rar ; 1,712,433 authors are saved in the file  AMiner-Author.zip and 4,258,615 collaboration relationships are saved in the file  AMiner-Coauthor.zip.

Advisor-advisee  

[Download]

This data set contains 6 different networks: Epinions, Slashdot, MobileU, MobileD, Coauthor, and Enron.

  • Epinions  is a network of product reviewers. Each user on the site can post a review for any product and other users would rate the review with trust or distrust. The data set consists of 131,828 users and 841,372 relationships, of which about 85.0% are trust relationships. 80,668 users received at least one trust or distrust relationships.
  • Slashdot  is a network of friends. Slashdot is a site for sharing technology related news. The data set is comprised of 77,357 users and 516,575 relationships of which 76.7% are ``friend'' relationships.
  • MobileU  is a network of mobile users. It consists of the logs of calls, blue-tooth scanning data and cell tower IDs of 107 users during about ten months. In total, the data contains 5,436 relationships.
  • MobileD  is a relatively larger mobile network of enterprise, where nodes are employees in a company and relationships are formed by calls and short messages sent between each other during a few months. In total, there are 232 users (50 managers and 182 ordinary employees) and 3,567 relationships (including calling and texting messages) between the users.
  • Coauthor  is a network of authors. The data set, crawled from Arnetminer.org, is comprised of 815,946 authors and 2,792,833 coauthor relationships.
  • Enron  is an email communication network. It consists of 136,329 emails between 151 Enron employees. Two types of relationships, i.e., manager-subordinate and colleague, were annotated between these employees. There are in total 3,572 relationships, of which 133 are manager-subordinate relationships.

Topic-coauthor  

[Download]

Co-author set consists of authors and coauthor relationship chosen from ArnetMiner. The dataset consists of 8 topics:

  • Topic 16: Data Mining / Association Rules

  • Topic 107: Web Services

  • Topic 131: Bayesian Networks / Belief function

  • Topic 144: Web Mining / Information Fusion

  • Topic 145: Semantic Web / Description Logics

  • Topic 162: Machine Learning

  • Topic 24: Database Systems / XML Data

  • Topic 75: Information Retrieval.

Topic-paper-author  

[Download]

The dataset is collected for the purpose of cross domain recommendation.

  • Data Mining:   We use papers of the following data mining conferences: KDD, SDM, ICDM, WSDM and PKDD as ground truth, which result in a network with 6,282 authors and 22,862 co-author relationships.
  • Medical Informatics:  We include the following journals: Journal of the American Medical Informatics Association, Journal of Biomedical Informatics, and Artificial Intelligence in Medicine, IEEE Trans. Med. Imaging and IEEE Transactions on Information and Technology in Biomedicine, from which we obtain a network of 9,150 authors and 31,851 coauthor relationships.
  • Theory:  We include the following conferences, i.e., STOC, FOCS and SODA, from which we get 5,449 authors and 27,712 co-author relationships.
  • Visualization:  We include the following conferences and journals, CVPR, ICCV, VAST, TVCG, IEEE Visualization and Information Visualization. The obtained coauthor network is comprised of 5,268 authors and 19,261 co-author relationships.
  • Database:  We include the following conferences, i.e., SIGMOD, VLDB and ICDE. From those conferences, we extract 7,590 authors and 37,592 co-author relationships.

Topic-citation  

[Download]

Citation network consists of paper and citation relationship chosen from ArnetMiner. The raw citation data consists of 2555 papers and 6101 citation relationship. The papers are mainly from 10 research fields:

  • Topic 16: Data Mining / Association Rules

  • Topic 107: Web Services

  • Topic 131: Bayesian Networks / Belief function

  • Topic 144: Web Mining / Information Fusion

  • Topic 145: Semantic Web / Description Logics

  • Topic 162: Machine Learning

  • Topic 24: Database Systems / XML Data

  • Topic 75: Information Retrieval

  • Topic 182: Pattern recognition / Image analysis

  • Topic 199: Natural Language System / Statistical Machine Translation.

Kernel community  

[Download]

This data set includes three different real-world social networks:

  • Coauthor  (a co-authorship network with 822,415 nodes and 2,928,360 undirected edges). Each vertex represents an author and each edge represents a co-author relation.
  • Wikipedia  (a co-editorship network with 310,990 nodes and 10,780,996 undirected edges crawled from Wikipedia.org). Each vertex represents a Wikipedia editor and each edge represents a co-editing relation.
  • Twitter  (a following network with 465,023 nodes and 833,590 directed edges crawled from twitter.com). Each vertex represents a Twitter user account and each edge represents a following relation. It is well-known that the web displays a bow-tie structure [20], where 30% of the vertices are strongly connected. We conduct a bow-tie analysis on the Twitter network, and discover that only 8% (38,913) of the vertices are strongly connected.

Dynamic coauthor  

[Download]

We construct the evolving coauthor network from ArnetMiner5. We collected 1,768,776 publications published during 1986 to 2012 with 1,629,217 authors involved. We regard each year as a time stamp and there are 27 time stamps in total. At each time stamp, we create an edge between two authors if they have coauthored at least one paper in the most recent 3 years (including the current year). We convert the undirected coauthor network into directed network by regarding each undirected edge as two symmetric directed edges.

Research profiling  

[Download]

We are developing extraction tools in ArnetMiner, a researcher social network system. The tool will be used to extract researcher profile from the Web page and outputs the extracted information into a researcher database.

The data set and related documents are used for researcher profile extraction.

Expert Finding  

[Download]

We have collected topics and their related people lists from as many sources as possible. We randomly chose 13 topics and created 13 people lists. The data sets were used as the “golden metric” for expert finding. They were also used to create the test sets for association search. The following table shows the 13 topics and statistics of people we have collected. In the 13 topics, OA and SW are from PC members of the related conferences or workshops. DM is from a list of data mining people organized by kmining.com. IE is from a list of information extraction researchers that were collected by Muslea. BS and SVM are from their official web sites, respectively. PL, IA, ML, and NLP are from a page organized by Russell and Norvig, which links to 849 pages around the web with information on Artificial Intelligence.

Topic model results for Arnetminer dataset  

[Download]
  • Aminer Author Name and ID

    It consists mapping between name and id of authors in Arnetminer. The data is form as a 2 column list. The first column is Arnetminer id and the second column is Author name.

  • Aminer Topic Top 5000 Publications and Authors

    It consists the top 5000 publication of each topics in Arnetminer. The data is formed as 3 xml files. Each consists data of topics, publications and authors respectively.

  • ACTMaps Author Topic

    It consists the topic distribution given author. The data is organized into 733602 rows, each for an author. For each row, it consists columns separated by a blank space. Each column is the topic id and weight separated by a ":"

  • Aminer FOAF Data Set

    It consists of the FOAF data of authors in arnetminer.org. The data is organized in standard FOAF format.

Coauthor  

[Download]

This data set contains 5 files:

  • AMiner-Author.zip

    This file saves the author information.

  • AMinerCoauthor.graph.zip

    This file saves the collaboration network among the authors in the first file.

  • AMinerCoauthor.dict.zip

    This file saves the mapping from the index in the first file to the ID in the second file.

  • AMinerCoauthor.panther.zip

    This file saves the top-50 similar authors and the corresponding similarities for each author in the above AMiner coauthor network output by the method Panther. The line number denotes the ID of the author whose similar authors are presented.

  • AMinerCoauthor.panther++.zip

    This file saves the top-50 similar authors and the corresponding Euclidean distances for each author in the above AMiner coauthor network output by the method Panther++. The line number denotes the ID of the author whose similar authors are presented.

Disambiguation  

[Download]

This data set is used for studying name disambiguation in digital library. It contains 110 author names and their disambiguation results (ground truth). Each author name corresponds to a raw file in the "raw-data" folder and an answer file (ground truth) in the "Answer" folder. (The simple version does not contain "citation", "co-affiliation-occur", "homepage". Refer to our ICDM 2011 paper for the definition of these features.)

Web User Profiling

Credit to the team leaded by Professor Jibing Gong and Haopeng Zhang from YSU (Yanshan University) for labeling some of the data.

1. Email

For Email extraction, we labeled a dataset of around 2000 people, for training and testing. The name list is selected randomly from  AMiner. For each person in this name list, we leveraged  Google  to search for and extract candidate email addresses. We used contact information in the Aminer system as most of the ground truths, and had some human experts (without knowledge about our classNameification model) to label and double-check the data.

2. Gender

For Gender inference, we offer a labeled xlsx file of around 2400 people from the  AMiner  system, with fields including name, organization, position and homepage.

Career Trajectory  

[Download]

We release the Aminer dataset for interested researchers. The dataset includes 57037 persons and 42230 affiliations harvested from Aminer. We have tried some effort to disambiguate persons with the same name and eliminate multiple writings of the same address (There may still be noises). We also collect 722 curricula vitae from the Internet which can be treated as the real world ground truth.

Network Integration

We have collected data from different social networking site. The dataset consists of two collections of social networks, where the networks within a collection are overlapped with each other (i.e. have users corresponding to the same real world person).

SNS network collection

The SNS data collection consists of five popular online social networking sites: Twitter, LiveJournal, Flickr, Last.fm, and MySpace.

The group truth mapping of SNS network collections was originally collected by Perito el. al through Google Profiles service. Please contact the original owner to obtain the data. Here, we provide a subset of the data for evaluation.

Twitter - Livejournal
Twitter - Flickr
Twitter - Lastfm
Twitter - MySpace
Livejournal - Flickr
Livejournal - Lastfm
Livejournal - MySpace
Filckr - Lastfm
Flickr - MySpace
Lastfm - MySpace

Academia network collection

The Academia data collection consists of three academic or professional social networks: ArnetMiner (AM), Linkedin and Videolectures.

The ground truth for Academia dataset is obtained through a crowdsourcing service on ArnetMiner. On each researcher's ArnetMiner profile, users can fill in urls linking to the external accounts. This service has been running on-line for more than one year and more than 10,000 interlinks record has been collected. Here, we provide a subset of the data for evaluation.

AMiner-Linkedin

Open Academic Graph

This data set is generated by linking two large academic graphs  Microsoft Academic Graph  (strong MAG   ) and  AMiner.

The data set is used for research purpose only. This version includes 166,192,182 papers from MAG and 154,771,162 papers from AMiner. We generated 64,639,608 linking (matching) relations between the two graphs. In the future, more linking results, like authors, will be published. It can be used as a unified large academic graph for studying citation network, paper content, and others, and can be also used to study integration of multiple academic graphs.

Name Disambiguation

Name ambiguity has long been viewed as a challenging problem in many applications, such as scientific literature management, people search, and social network analysis. When we search a person name in these systems, many documents (e.g., papers, webpages) containing that person’s name may be returned. Which documents are about the person we care about? Although much research has been conducted, the problem remains largely unsolved, especially with the rapid growth of the people information available on the Web.

Science Knowledge Graph

SciKG is a rich knowledge graph designed for scientific purpose (currently including computer science (CS)), consisting of concepts, experts, and papers. The concepts and their relationships are extracted from  ACM  computing classNameification system, supplemented with the definition of each concept from, e.g., Wikipedia. We further use  AMiner  to associate top ranked experts and most relevant papers to each concept. Each expert has position, affiliation, research interests and also the link connecting to AMiner (for further rich information if necessary) and each paper contains meta information such as title, authors, abstract, publication venue, and year.

Knowledge Graph for AI

130,750 scholars, 343,746 scholarily articales, 229,937 specialties from 103 conferences

AMiner Knowledge Graph

AMiner  Knowledge Graph  is a structured entity network extracted from  AMiner. It is comprised of over 500,00 entities and about 290,000,000 links among them. The knowledge graph can be used as a benchmark to study knowledge graph construction and also used as an external resource for search/recommendation.

Knowledge Graph for Data Mining  

[Download]

二级节点23个,三级节点309个

Knowledge Graph for Knowledge Graph  

[Download]

Knowledge Graph for Knowledge

Top 10000 Scholars' Trajectories  

[Download]

Trajectories of 9992 experts with the greatest h-index in AMiner science 1978

Knowledge Graph for Machine Learning  

[Download]

Knowledge Graph for Machine Learning