|Provide comprehensive open datasets about COVID-19 all over the world|
|1572277 papers||2084019 citation relationships||Citation network|
|2,092,356 papers/1,712,433 authors||8,024,869 citation relationships/4,258,615 coauthor relationships||citation and coauthor networks|
|4794 authors||2164 advisor-advisee,3932 coauthor relationships||Advisor-advisee network|
|640134 authors of 8 topics||1554643 coauthor relationships||Topic based Coauthor network|
|33739 authors of 5 topics||139278 coauthor relationships||Created for cross domain recommendation|
|2329760 papers||12710347 citations relationships||Topic based citation network|
|8000 papers of 27 conferences||Created for community detection|
|1629217 authors||2623832 coauthor relationships||An evolving coauthor network with 27 time stamps|
|898 files||Created for researcher profile extraction|
|155 citation pairs||Created to study the semantics of the citation relationships|
|1781 experts of 13 topics||A benchmark for expert finding|
|8369 author pairs of 9 topics||Created for association search|
|Top 1000000 papers and authors of 200 topics||The results of ACT model on AMiner dataset|
|1560640 authors||4258946 coauthor relationships||Coauthor network|
|110 authors and their affiliations/papers||(a) 6,730 papers for 100 author names; (b) 1,085 Web pages for 12 person names; (c) 755 ambiguous entities appearing in 20 news pages.|
|emails of 2,000 people and gender of 2,400 people||Created for web use profiling|
|57,037 persons and 42,230 affiliations||Created for studying career trajectories of scholars|
|two data collections: SNS and Academic||Created for network integration|
|166,192,182 papers from MAG 154,771,162 papers from AMiner, and 64,639,608 linking (matching) relations||Created for studying the integration of multiple academic graphs|
|23,823 names and 83,980 persons||Created for studying author name disambiguation|
|908 concepts, 206,240 experts and 512,698 publications||A knowledge graph consisting of concepts, experts, and papers in Computer Science|
|130,750 scholars, 343,746 scholarily articales, 229,937 specialties from 103 conferences|
|100,000 tags, 318,406 scholars, 63,068 organizations and 23,709 venues||A structured entity network extracted from AMiner|
|二级节点23个，三级节点309个||Knowledge Graph for Data Mining|
|二级节点11个，三级节点212个||Knowledge Graph for Knowledge Graph|
|9992 experts with the greatest h-index in AMiner||Trajectories of 9992 experts with the greatest h-index in AMiner science 1978|
|机器学习八级知识图谱||Knowledge Graph for Machine Learning|
|5340 users and 14967 items||163084 clicks||User-Paper interactions on AMiner|
COVID-19 Open Datasets[Download]
For fighting against COVID-19 pandemic, open and comprehensive big data may help researchers, officials, medical staffs and crowds to understand the virus and pandemic more. The team have been collecting all kinds of open datasets about COVID-19 and keeps updating everyday. The datasets include pandemic, research, knowledge graph, media reports and so on.
The data set is designed for research purpose only. The citation data is extracted from DBLP, ACM, and other sources. The first version contains 629,814 papers and 632,752 citations. Each paper is associated with abstract, authors, year, venue, and title.
The data set can be used for clustering with network and side information, studying influence in the citation network, finding the most influential papers, topic modeling analysis, etc.
A larger version will be released soon.
Academic Social Network[Download]
The content of this data includes paper information, paper citation, author information and author collaboration. 2,092,356 papers and 8,024,869 citations between them are saved in the file AMiner-Paper.rar ; 1,712,433 authors are saved in the file AMiner-Author.zip and 4,258,615 collaboration relationships are saved in the file AMiner-Coauthor.zip.
This data set contains 6 different networks: Epinions, Slashdot, MobileU, MobileD, Coauthor, and Enron.
- Epinions is a network of product reviewers. Each user on the site can post a review for any product and other users would rate the review with trust or distrust. The data set consists of 131,828 users and 841,372 relationships, of which about 85.0% are trust relationships. 80,668 users received at least one trust or distrust relationships.
- Slashdot is a network of friends. Slashdot is a site for sharing technology related news. The data set is comprised of 77,357 users and 516,575 relationships of which 76.7% are ``friend'' relationships.
- MobileU is a network of mobile users. It consists of the logs of calls, blue-tooth scanning data and cell tower IDs of 107 users during about ten months. In total, the data contains 5,436 relationships.
- MobileD is a relatively larger mobile network of enterprise, where nodes are employees in a company and relationships are formed by calls and short messages sent between each other during a few months. In total, there are 232 users (50 managers and 182 ordinary employees) and 3,567 relationships (including calling and texting messages) between the users.
- Coauthor is a network of authors. The data set, crawled from Arnetminer.org, is comprised of 815,946 authors and 2,792,833 coauthor relationships.
- Enron is an email communication network. It consists of 136,329 emails between 151 Enron employees. Two types of relationships, i.e., manager-subordinate and colleague, were annotated between these employees. There are in total 3,572 relationships, of which 133 are manager-subordinate relationships.
Citation network consists of paper and citation relationship chosen from ArnetMiner. The raw citation data consists of 2555 papers and 6101 citation relationship. The papers are mainly from 10 research fields:
Topic 16: Data Mining / Association Rules
Topic 107: Web Services
Topic 131: Bayesian Networks / Belief function
Topic 144: Web Mining / Information Fusion
Topic 145: Semantic Web / Description Logics
Topic 162: Machine Learning
Topic 24: Database Systems / XML Data
Topic 75: Information Retrieval
Topic 182: Pattern recognition / Image analysis
Topic 199: Natural Language System / Statistical Machine Translation.
This data set includes three different real-world social networks:
- Coauthor (a co-authorship network with 822,415 nodes and 2,928,360 undirected edges). Each vertex represents an author and each edge represents a co-author relation.
- Wikipedia (a co-editorship network with 310,990 nodes and 10,780,996 undirected edges crawled from Wikipedia.org). Each vertex represents a Wikipedia editor and each edge represents a co-editing relation.
- Twitter (a following network with 465,023 nodes and 833,590 directed edges crawled from twitter.com). Each vertex represents a Twitter user account and each edge represents a following relation. It is well-known that the web displays a bow-tie structure , where 30% of the vertices are strongly connected. We conduct a bow-tie analysis on the Twitter network, and discover that only 8% (38,913) of the vertices are strongly connected.
We are developing extraction tools in ArnetMiner, a researcher social network system. The tool will be used to extract researcher profile from the Web page and outputs the extracted information into a researcher database.
The data set and related documents are used for researcher profile extraction.
We have collected topics and their related people lists from as many sources as possible. We randomly chose 13 topics and created 13 people lists. The data sets were used as the “golden metric” for expert finding. They were also used to create the test sets for association search. The following table shows the 13 topics and statistics of people we have collected. In the 13 topics, OA and SW are from PC members of the related conferences or workshops. DM is from a list of data mining people organized by kmining.com. IE is from a list of information extraction researchers that were collected by Muslea. BS and SVM are from their official web sites, respectively. PL, IA, ML, and NLP are from a page organized by Russell and Norvig, which links to 849 pages around the web with information on Artificial Intelligence.
To evaluate the effectiveness of our proposed association search approach, we created 8 test sets. Each of the person pair contains a source person (including his name and id) and a target person (including his name and id). The test sets were created as follows. We randomly selected 1,000 person pairs from the researcher network and create the first test set.
We use the above people lists to create the other 8 test sets. We created four test sets by randomly selecting person pairs from SW, DM, and IE respectively. With the three test sets, we are aimed at testing association search between persons from the same research community. We created the other five test sets by selecting persons from different research fields.
Topic model results for Arnetminer dataset[Download]
Aminer Author Name and ID
It consists mapping between name and id of authors in Arnetminer. The data is form as a 2 column list. The first column is Arnetminer id and the second column is Author name.
Aminer Topic Top 5000 Publications and Authors
It consists the top 5000 publication of each topics in Arnetminer. The data is formed as 3 xml files. Each consists data of topics, publications and authors respectively.
ACTMaps Author Topic
It consists the topic distribution given author. The data is organized into 733602 rows, each for an author. For each row, it consists columns separated by a blank space. Each column is the topic id and weight separated by a ":"
Aminer FOAF Data Set
It consists of the FOAF data of authors in arnetminer.org. The data is organized in standard FOAF format.
This data set is used for studying name disambiguation in digital library. It contains 110 author names and their disambiguation results (ground truth). Each author name corresponds to a raw file in the "raw-data" folder and an answer file (ground truth) in the "Answer" folder. (The simple version does not contain "citation", "co-affiliation-occur", "homepage". Refer to our ICDM 2011 paper for the definition of these features.)
Web User Profiling[Download]
Credit to the team leaded by Professor Jibing Gong and Haopeng Zhang from YSU (Yanshan University) for labeling some of the data.
For Email extraction, we labeled a dataset of around 2000 people, for training and testing. The name list is selected randomly from AMiner. For each person in this name list, we leveraged Google to search for and extract candidate email addresses. We used contact information in the Aminer system as most of the ground truths, and had some human experts (without knowledge about our classNameification model) to label and double-check the data.
For Gender inference, we offer a labeled xlsx file of around 2400 people from the AMiner system, with fields including name, organization, position and homepage.
We release the Aminer dataset for interested researchers. The dataset includes 57037 persons and 42230 affiliations harvested from Aminer. We have tried some effort to disambiguate persons with the same name and eliminate multiple writings of the same address (There may still be noises). We also collect 722 curricula vitae from the Internet which can be treated as the real world ground truth.
We have collected data from different social networking site. The dataset consists of two collections of social networks, where the networks within a collection are overlapped with each other (i.e. have users corresponding to the same real world person).
SNS network collection
The SNS data collection consists of five popular online social networking sites: Twitter, LiveJournal, Flickr, Last.fm, and MySpace.
The group truth mapping of SNS network collections was originally collected by Perito el. al through Google Profiles service. Please contact the original owner to obtain the data. Here, we provide a subset of the data for evaluation.
Twitter - Livejournal
Twitter - Flickr
Twitter - Lastfm
Twitter - MySpace
Livejournal - Flickr
Livejournal - Lastfm
Livejournal - MySpace
Filckr - Lastfm
Flickr - MySpace
Lastfm - MySpace
Academia network collection
The Academia data collection consists of three academic or professional social networks: ArnetMiner (AM), Linkedin and Videolectures.
The ground truth for Academia dataset is obtained through a crowdsourcing service on ArnetMiner. On each researcher's ArnetMiner profile, users can fill in urls linking to the external accounts. This service has been running on-line for more than one year and more than 10,000 interlinks record has been collected. Here, we provide a subset of the data for evaluation.
Open Academic Graph[Download]
The data set is used for research purpose only. This version includes 166,192,182 papers from MAG and 154,771,162 papers from AMiner. We generated 64,639,608 linking (matching) relations between the two graphs. In the future, more linking results, like authors, will be published. It can be used as a unified large academic graph for studying citation network, paper content, and others, and can be also used to study integration of multiple academic graphs.
Name ambiguity has long been viewed as a challenging problem in many applications, such as scientific literature management, people search, and social network analysis. When we search a person name in these systems, many documents (e.g., papers, webpages) containing that person’s name may be returned. Which documents are about the person we care about? Although much research has been conducted, the problem remains largely unsolved, especially with the rapid growth of the people information available on the Web.
Science Knowledge Graph[Download]
SciKG is a rich knowledge graph designed for scientific purpose (currently including computer science (CS)), consisting of concepts, experts, and papers. The concepts and their relationships are extracted from ACM computing classNameification system, supplemented with the definition of each concept from, e.g., Wikipedia. We further use AMiner to associate top ranked experts and most relevant papers to each concept. Each expert has position, affiliation, research interests and also the link connecting to AMiner (for further rich information if necessary) and each paper contains meta information such as title, authors, abstract, publication venue, and year.
Knowledge Graph for AI[Download]
130,750 scholars, 343,746 scholarily articales, 229,937 specialties from 103 conferences
AMiner Knowledge Graph[Download]
AMiner Knowledge Graph is a structured entity network extracted from AMiner. It is comprised of over 500,00 entities and about 290,000,000 links among them. The knowledge graph can be used as a benchmark to study knowledge graph construction and also used as an external resource for search/recommendation.
Knowledge Graph for Data Mining[Download]
Knowledge Graph for Knowledge Graph[Download]
Knowledge Graph for Knowledge
Top 10000 Scholars' Trajectories[Download]
Trajectories of 9992 experts with the greatest h-index in AMiner science 1978
Knowledge Graph for Machine Learning[Download]
Knowledge Graph for Machine Learning
AMiner Paper Click[Download]
This dataset collects users' behaviour on AMiner from Aug. 2021 to Nov. 2021. We filtered most sparse records and kept users who at least clicked more than 10 papers. The partition of train/validation/test file has been made and kept the ratio as 8:1:1. For each line in each file, the first number is the user id, and the rest numbers are the paper id which this user clicked.