CREPE: Coordinate-Aware End-to-End Document Parser
arxiv(2024)
摘要
In this study, we formulate an OCR-free sequence generation model for visual
document understanding (VDU). Our model not only parses text from document
images but also extracts the spatial coordinates of the text based on the
multi-head architecture. Named as Coordinate-aware End-to-end Document Parser
(CREPE), our method uniquely integrates these capabilities by introducing a
special token for OCR text, and token-triggered coordinate decoding. We also
proposed a weakly-supervised framework for cost-efficient training, requiring
only parsing annotations without high-cost coordinate annotations. Our
experimental evaluations demonstrate CREPE's state-of-the-art performances on
document parsing tasks. Beyond that, CREPE's adaptability is further
highlighted by its successful usage in other document understanding tasks such
as layout analysis, document visual question answering, and so one. CREPE's
abilities including OCR and semantic parsing not only mitigate error
propagation issues in existing OCR-dependent methods, it also significantly
enhance the functionality of sequence generation models, ushering in a new era
for document understanding studies.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要