Multi-Type-TD-TSR – Extracting Tables from Document Images Using a Multi-stage Pipeline for Table Detection and Table Structure Recognition: From OCR to Structured Table Representations

ADVANCES IN ARTIFICIAL INTELLIGENCE, KI 2021(2021)

引用 4|浏览9
暂无评分
摘要
As global trends are shifting towards data-driven industries, the demand for automated algorithms that can convert images of scanned documents into machine readable information is rapidly growing. In addition to digitization there is an improvement toward process automation that used to require manual inspection of documents. Although optical character recognition (OCR) technologies mostly solved the task of converting human-readable characters from images, the task of extracting tables has been less focused on. This recognition consists of two sub-tasks: table detection and table structure recognition. Most prior work on this problem focuses on either task without offering an end-to-end solution or paying attention to real application conditions like rotated images or noise artefacts. Recent work shows a clear trend towards deep learning using transfer learning for table structure recognition due to the lack of sufficiently large datasets. We present a multistage pipeline named Multi-Type-TD-TSR, which offers an end-to-end solution for table recognition. It utilizes state-of-the-art deep learning models and differentiates between three types of tables based on their borders. For the table structure recognition we use a deterministic non-data driven algorithm, which works on all three types. In addition, we present an algorithm for non-bordered tables and one for bordered ones as the basis of our table structure detection algorithm. We evaluate Multi-Type-TD-TSR on a self annotated subset of the ICDAR 2019 table structure recognition dataset [5] and achieve a new state-of-the-art. Source code is available under https://github.com/Psarpei/Multi-Type-TD-TSR.
更多
查看译文
关键词
Multi-Type-TD-TSR,Table detection,Table structure recognition,Multi-stage pipeline,OCR
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要