Training Full-Page Handwritten Text Recognition Models without Annotated Line Breaks
2019 International Conference on Document Analysis and Recognition (ICDAR)(2019)
摘要
Training Handwritten Text Recognition (HTR) models typically requires large amounts of labeled data which often are line or page images with corresponding line-level ground truth (GT) transcriptions. Many digital collections have page-level transcriptions for each image, but the transcription is unformatted, i.e., line breaks are not annotated. Can we train lined-based HTR models using such data? In this work, we present a novel alignment technique for segmenting page-level GT text into text lines during HTR model training. This text segmentation problem is formulated as an optimization problem to minimize the cost of aligning predicted lines with the GT text. Using both simulated and HTR model predictions, we show that the alignment method identifies line breaks accurately, even when the predicted lines have high character error rates (CER). We removed the GT line breaks from the ICDAR-2017 READ dataset and trained a HTR model using the proposed alignment method to predict line breaks on-the-fly. This model achieves comparable CER w.r.t. to the same model trained with the GT line breaks. Additionally, we downloaded an online digital collection of 50K English journal pages (not curated for HTR research) whose transcriptions do not contain line breaks, and achieve 11% CER.
更多查看译文
关键词
Handwriting Recognition,Deep Learning,Text Alignment
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络