Extracting Figures and Captions from Scientific Publications.

Pengyuan Li,Xiangying Jiang,Hagit Shatkay

CIKM（2018）

引用 8|浏览19

暂无评分

摘要

Figures and captions convey essential information in scientific publications. As such, there is a growing interest in mining published figures and in utilizing their respective captions as a source of knowledge. There is also much interest in image captioning systems that can automatically generate captions for images, whose training requires large datasets of image-caption pairs. Notably, the first fundamental step of obtaining figures and captions from publications is neither well-studied nor yet well-addressed. In this paper, we introduce a new and effective system for figure and caption extraction, PDFigCapX. Unlike current methods that extract figures by handling raw encoded contents of PDF documents, we separate text from graphical contents and utilize layout information to detect and disambiguate figures and captions. Files containing the figures and their associated captions are then produced as output to the end-user. We test PDFigCapX on both a previously used generic dataset and on two new sets of publications within the biomedical domain. Our experiments and results show a significant improvement in performance compared to the state-of-the-art, and demonstrate the effectiveness of our approach. Our system will be available for use at: https://www.eecis.udel.edu/~compbio/PDFigCapX.

查看译文

关键词

Data extraction, scientific document analysis, figure extraction, caption extraction, PDF parsing

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要