A detailed library perspective on nearly unsupervised information extraction workflows in digital libraries

Hermann Kroll,Jan Pirklbauer,Florian Plötzky,Wolf-Tilo Balke

2022 ACM/IEEE Joint Conference on Digital Libraries (JCDL)（2023）

引用 0|浏览12

暂无评分

摘要

Information extraction can support novel and effective access paths for digital libraries. Nevertheless, designing reliable extraction workflows can be cost-intensive in practice. On the one hand, suitable extraction methods rely on domain-specific training data. On the other hand, unsupervised and open extraction methods usually produce not-canonicalized extraction results. This paper is an extension of our original work and tackles the question of how digital libraries can handle such extractions and whether their quality is sufficient in practice. We focus on unsupervised extraction workflows by analyzing them in case studies in the domains of encyclopedias (Wikipedia), Pharmacy, and Political Sciences. As an extension, we analyze the extractions in more detail, verify our findings on a second extraction method, discuss another canonicalizing method, and give an outlook on how non-English texts can be handled. Therefore, we report on opportunities and limitations. Finally, we discuss best practices for unsupervised extraction workflows.

查看译文

关键词

Open information extraction,Extraction workflows,Digital libraries

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要