Extraction of Math Expressions from PDF Documents Based on Unsupervised Modeling of Fonts

2019 International Conference on Document Analysis and Recognition (ICDAR)(2019)

引用 9|浏览6
暂无评分
摘要
This paper proposes a multi-stage architecture to extract math expressions (ME) from PDF documents based on font analysis. The unsupervised algorithm starts from symbol level analysis based on metadata of PDF objects, including font size, font name, and glyph name. Two subsequent stages utilize a group of spatial and semantic heuristics to merge multiple ME symbols into both inline ME and displayed ME. The algorithm is tested on the Marmot dataset (amended with missing cases). For displayed ME, the proposed method achieved 93.6% precision, 99.4% recall, and 96.4% F1-score. For inline ME, the method achieved 92.2% precision, 91.9% recall, and 92.1% F1-score. In addition, the algorithm only takes an average of 1.09s to process a page, which is faster than other existing methods.
更多
查看译文
关键词
math expressions,pdf documents,unsupervised learning,likehood ratio test
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要