How OCR Performance can Impact on the Automatic Extraction of Dictionary Content Structures

Mohamed Khemakhem, Ioana Galleron, Geoffrey Williams,Laurent Romary,Pedro Ortiz Suárez

user-5ebe28d54c775eda72abcdf7（2019）

引用 0|浏览21

暂无评分

摘要

In the last decade, OCR progress has triggered a massive trend towards the digitisation of legacy documents, with several Digital Humanities projects exploring 123 means for structuring retro-digitised dictionaries. However there is a lack of awareness of the impact of the OCRs quality on the information extraction process. In this work, we shed light on the relationship between these two steps through experiments carried out with a TEI-based system for automatic parsing of dictionaries.Our work concerns “the Basnage”, a complex dictionary resulting from the complete revision and enlargement in 1701 of the ‘Dictionnaire Universel’of Abbé Furetière, initially published in 1690. In order to obtain an XML/TEI version of this work, we use GROBID-Dictionaries [1, 2], a machine learning system for cascade parsing and extraction of TEI structure in dictionaries. The tool’s models have been tested on different categories of entry based documents with lexical and encyclopedic content. We used two differently OCRied versions of the first volume of the Basnage following 4 the process described in an earlier experiment [3] which relies on the power of iterative training of HTR models of Transkribus framework: 5

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要