Read, Spell and Repeat: Scene Text Recognition with Vision-Language Circular Refinement

ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)(2024)

引用 0|浏览1
暂无评分
摘要
Scene Text Recognition (STR) has long been considered an important yet challenging task in the field of computer vision. Recent works have demonstrated that utilizing language information is effective for the visually difficult images, like ones with occultation or blurring. However, the use of language information sometimes leads to the over-correction problem. For out-of-vocabulary samples (e.g. "hou" and "0x4a"), some methods have tended to be biased to language side and over-corrected (e.g. over-correct "hou" to "hot"). This imbalance of vision and language has limited the usage of models in practical scenarios, yet it is rarely occurs for human. To address this issue, we rethink the human’s recognition process and propose a model behaving in the order of "Read, Spell and Repeat". It refines the recognition process circularly with vision and language information. With this mechanism, our model integrates vision and language information in a more effective manner, achieving higher accuracy with less parameters compared to baseline and competitive performance with SOTA methods in the standard benchmarks.
更多
查看译文
关键词
Scene Text Recognition,Neural Network,Deep Learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要