63 DatasetGraph Dataset OnlyOAG Benchmark Only
OAG-Paper-TOT: This dataset collates Test-of-Time paper award data for selected conferences and journals in computer science. The meaning given to a paper by the Test-of-Time Award is: the paper has produced a huge theoretical or applied influence several years after its publication. Awards with similar meanings include the Most Influential Award, Hall of Fame Award, etc. At present, a total of 1063 papers in the computer science field that are awarded by 2022 have been collected.AuthPred-2017 and AuthPred-2022 are to predict the citation number of authors in the future. Given the collection of papers published no later than 𝑦𝑟 (𝑦𝑟 is a year) and the citation relationship between papers, each paper contains attributes such as title, author, published conference or journal, year, etc. The goal is to predict the author's citations in 𝑦𝑟 + 𝛥𝑦𝑟 year.
Given a paper 𝑝 (including the full text of the paper) and its references, the goal is to find the most important references (called ref-source) from the references. Ref-source largely inspired the paper 𝑝 in terms of ideas or methods. A paper can have one or more important references. For each reference in the paper 𝑝, an importance score in the range [0, 1] is calculated.Since the labeling of paper source tracing requires strong expertise, dozens of graduate students in the computer science field were hired to let them mark the source of papers in their familiar fields. After collection and preprocessing, the labeled data of 1120 papers in the computer science field were obtained.
We construct an academic QA dataset from real questions from Question-and-Answers (Q&A) forums. We retrieve question posts from StackExchange and Zhihu and extract the paper URLs cited by users in their answers. The paper is aligned with OAG and Semantic Scholar. The dataset contains 17,948 question-paper pairs and we divide the question-paper pairs into 22 disciplines and 87 topics respectively and has a two-level hierarchical structure where each topic is under a specific discipline.
Reviewer RecommendationGraph Dataset
This dataset collects real paper-submission matching relations from Frontiers. In this dataset, we collect 210,069 reviewers and 225,478 papers.
AMiner Paper ClickGraph Dataset
User-Paper interactions on AMiner.This dataset collects users' behaviour on AMiner from Aug. 2021 to Nov. 2021. We filtered most sparse records and kept users who at least clicked more than 10 papers. The partition of train/validation/test file has been made and kept the ratio as 8:1:1. For each line in each file, the first number is the user id, and the rest numbers are the paper id which this user clicked.
Concept Taxonomy ExpansionGraph Dataset
Given an existing concept hierarchy tree (Taxonomy) 𝑇0 and a set of new concepts 𝐶, the goal is to predict its hypernym pa(𝑐) ∈ 𝑇0 for each new concept 𝑐 ∈ 𝐶 to expand the existing concept hierarchy tree. We provide three concept taxonomy datasets here.
Entity TaggingGraph Dataset
OPEDAC-2017: This dataset includes manually-labeled research interest labels of 11357 scholars and the papers of these scholars. Data from 6000 scholars are used for training and the remaining data are used for a test set.DBLP-Paper-Topic: This dataset is based on DBLP paper citation network. The topic of each paper is determined by its venue. There are 9 topics in total.
WhoIsWho owning, the world’s largest manually-labeled benchmark with over 1,000,000 papers built using an interactive annotation process,A regular leaderboard with comprehensive tasks, i.e., From-scratch Name Disambiguation, Real-time Name Disambiguation, and Incorrect Assignment Detection. The historical contests of WhoIsWho have already attracted more than 3,000 researchers.
CCKS2021-En comes from AMiner, which is an English subset of CCKS 2021 scholar profiling track. This dataset includes 9921 scholars.ScholarXL comes from the text description of the scholar's official homepage, and each attribute is manually marked with its starting and ending position in the text. This dataset is for long text extraction.
The dataset contains 2,000 AI 2000 scholars in 20 sub-domains of artificial intelligence.