Information Extraction Workflow for Digitised Entry-based Documents

DARIAH Annual event 2020(2020)

引用 0|浏览26
The massive retro-digitisation of legacy paper resources in the last decade, along with the constant growth of compiled unstructured digital text material, have created an unbalanced situation where the existent ad hoc techniques for exploiting such resources are unable to cover the important stream of emerging corpora. In this workshop we address this issue and present an exploratory workflow implemented in two state of the art infrastructures for Information Extraction (IE) from documents with entry-based structure and diverse content.IE in Digital Humanities (DH) has always been a serious challenge for researchers dealing with modern or legacy text resources [1, 2]. GROBID-Dictionaries1 is a project which has been launched to fill in this gap by accelerating the modelling and structuring of resources within the lexicography field. The first version of the machine learning infrastructure has been focused on structuring digitised dictionaries into TEI-compliant resources [3]. In GROBID-Dictionaries, the activation of cascading IE models follows an exploratory process based on the MATTER workflow [4]. Throughout a multistage annotation and curation process, a user of the tool discovers gradually the structure and the variation of the information in a target document.
AI 理解论文
Chat Paper