Making Two Vast Historical Manuscript Collections Searchable and Extracting Meaningful Textual Features Through Large-Scale Probabilistic Indexing

Alejandro Héctor Toselli,Verónica Romero-Gomez,Joan-Andreu Sánchez,Enrique Vidal-Ruiz

2019 International Conference on Document Analysis and Recognition (ICDAR)（2019）

引用 19|浏览5

暂无评分

摘要

Textual access to large collections of digitized images remains unfeasible because usually they lack transcripts. Transcribing such collections is in turn typically unattainable in terms of costs. However, the use of probabilistic indices can facilitate textual accessing with only moderate demands of resources. Besides allowing effortless information retrieval, it will be shown that probabilistic indices can also be used to estimate textual features of the indexed but otherwise untranscribed collections, such as running words and Zipf's curves. Complete probabilistic indices have been recently produced for two iconic large collections: "Bentham" (90K images) and "Spanish Golden Age Theater" (40K images). To show the repercussion of making these collections searchable, we provide accessing statistics gathered through their corresponding search interfaces. To the best of our knowledge this is the first publication of large collections of untranscribed manuscripts which are now publicly accessible for effective and efficient textual access.

查看译文

关键词

search on large historical manuscript collections,probabilistic indexing and search,Zipf's law,keyword spotting,handwritten text

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要