Automated generation of text handles from scanned images of scholarly articles for indexing in digital archive

MULTIMEDIA TOOLS AND APPLICATIONS(2022)

引用 0|浏览1
暂无评分
摘要
There have been extensive studies and rapid improvements in automated document categorization, document retrieval, document recommendations, etc. These trendy and essential tasks are associated with information retrieval or data extraction. Also, the document organization process is gradually becoming fully automated for storage in archives. The categorization and indexing of scholarly articles remain a challenge and a real need with a rapid increase in the volume of scholarly articles. Also, there is a need of automation for proper indexing and retrieval of the old scholarly articles in libraries that are available in thousands as print versions. In this paper, we propose a method for simple and robust generation of text handles from the scanned images of scholarly articles to manage them in digital archives efficiently. We have also proposed a Delaunay triangulation based feature set for the associated categorization work. The theme of the proposed work is mainly based on the idea of tracking the locality of emphasized (italic) words. We have primarily considered the articles’ titles and reference pages for crucial information extraction to find handles. The detection of italics is proposed using Principal Component Analysis (PCA). The PCA is applied to a selective subset of object boundary pixels representing the vertical or column edges. We have shown how efficiently this proposed method can generate text handles for indexing scholarly articles.
更多
查看译文
关键词
Document indexing,Digital library,Scholarly article categorization,Document retrieval
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要