Extracting Figures and Captions from Scientific Publications.


引用 8|浏览19
Figures and captions convey essential information in scientific publications. As such, there is a growing interest in mining published figures and in utilizing their respective captions as a source of knowledge. There is also much interest in image captioning systems that can automatically generate captions for images, whose training requires large datasets of image-caption pairs. Notably, the first fundamental step of obtaining figures and captions from publications is neither well-studied nor yet well-addressed. In this paper, we introduce a new and effective system for figure and caption extraction, PDFigCapX. Unlike current methods that extract figures by handling raw encoded contents of PDF documents, we separate text from graphical contents and utilize layout information to detect and disambiguate figures and captions. Files containing the figures and their associated captions are then produced as output to the end-user. We test PDFigCapX on both a previously used generic dataset and on two new sets of publications within the biomedical domain. Our experiments and results show a significant improvement in performance compared to the state-of-the-art, and demonstrate the effectiveness of our approach. Our system will be available for use at: https://www.eecis.udel.edu/~compbio/PDFigCapX.
Data extraction, scientific document analysis, figure extraction, caption extraction, PDF parsing
AI 理解论文
Chat Paper