Semi-Automatic LaTeX-Based Labeling of Mathematical Objects in PDF Documents - MOP Data Set.

Donald Beyette,Zelun Wang,Jason Lin,Jyh-Charn Liu

DocEng（2019）

引用 0|浏览13

暂无评分

摘要

Mathematical objects (MO) in PDF documents is paramount in understanding the ontology and mathematical essence in published science, technology, engineering, and mathematical (STEM) documents. As of now, Marmot is the only publicly available data set for optimizing and evaluating MO labeling models in PDF documents. Thus, this paper proposes a semiautomatic labeling MO algorithm that uses PDF documents and their corresponding LaTeX source files to generate a new data set consisting of MO bounding boxes (Bbox) in PDF documents, their LaTeX equation, topic, and subject. The first step in labeling each MO is to transform the LaTeX and PDF documents into a string format. Afterwards, a shortest unique string-matching technique is proposed to align PDF pages with LaTeX files. On each page, a similar shortest string-matching technique is employed to align each LaTeX MO with its PDF counterpart. Once an MO is located, the PDF and LaTeX MOs are normalized in order to match symbols between their LaTeX and PDF representations. A number of filtering rules are set to eliminate matches that are considered exceedingly inconsistent. Matches that pass these rules will have their MOs highlighted for final manual inspection. A total of 1,802 pages in the high energy physics (hep-th) field were labelled.1

查看译文

关键词

Mathematical object, LaTeX, PDF, ground truth, semi-automatic labeling, ontology

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要