AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
Linguistic representation is utilized in scene text detection to deal with the problem of text detection ambiguity
AE TextSpotter: Learning Visual and Linguistic Representation for Ambiguous Text Spotting
european conference on computer vision, pp.457-473, (2020)
Scene text spotting aims to detect and recognize the entire word or sentence with multiple characters in natural images. It is still challenging because ambiguity often occurs when the spacing between characters is large or the characters are evenly spread in multiple rows and columns, making many visually plausible groupings of the cha...More
PPT (Upload PPT)
- Text analysis in unconstrained scene images like text detection and text recognition is important in many applications, such as document recognition, license plate recognition, and visual question answering based on texts.
- This work addresses one of the important challenges, which is reducing the ambiguous bounding box proposals in scene text detection
- These ambiguous proposals widely occur when the spacing of the characters of a word is large or multiple text lines are juxtaposed in different rows or columns in an image.
- As shown in Fig. 1(c), these vision-based text detectors are insufficient to detect text lines correctly in ambiguous samples
- Text analysis in unconstrained scene images like text detection and text recognition is important in many applications, such as document recognition, license plate recognition, and visual question answering based on texts
- In the re-scoring step, we propose a language module (LM) that can learn linguistic representation to re-score the candidate text lines and to eliminate ambiguity, making the text lines that correspond to natural language have higher scores than those not
- Linguistic representation is utilized in scene text detection to deal with the problem of text detection ambiguity
- LM can effectively lower the scores of incorrect text lines while improve the scores of correct proposals
- Extensive experiments demonstrate the advantages of our method, especially in scenarios of text detection ambiguity
- Fig. 3 shows the overall architecture of AE TextSpotter, which consists of two vision-based modules and one language-based module, namely, the text detection module (TDM), the character-based recognition module (CRM), and the language module (LM)
- Among these modules, TDM and CRM aim to detect the bounding boxes and recognize the content of candidate text lines; and LM is applied to lower the scores of incorrect text lines by utilizing linguistic features, which is the key module to remove ambiguous samples.
- The authors carefully select a set of extremely ambiguous samples from the IC19-ReCTS dataset, where the approach surpasses other methods by more than 4%.
- ReCTS, the model with LM obtains the F-measure of 81.39 and the 1-NED of 51.32%, significantly surpassing the model without LM by 3.46% and 3.57%.
- As shown in Table 5, the AE TextSpotter achieves the F-measure of 91.80% and the 1-NED of 71.81%, surpassing other methods
- As demonstrated in previous experiments, the proposed AE TextSpotter works well in most cases, including scenarios of text detection ambiguity.
- The authors proposed a novel text spotter, termed AE TextSpotter, which introduces linguistic representation to eliminate ambiguity in text detection.
- Linguistic representation is utilized in scene text detection to deal with the problem of text detection ambiguity.
- Extensive experiments demonstrate the advantages of the method, especially in scenarios of text detection ambiguity
- Table1: The proportion of text lines with the problem of text detection ambiguity
- Table2: The recall of TDM and the number
- Table3: The time cost per image of candidate text lines per image under different and 1-NED of different recognizers post-processing thresholds
- Table4: The single-scale results on TDA-ReCTS. “P”, “R”, “F” and “1-NED” mean the precision, recall, F-measure, and normalized edit distance [<a class="ref-link" id="c32" href="#r32">32</a>], respectively
- Table5: The single-scale results on the IC19-ReCTS test set. “P”, “R”, “F” and “1-NED” represent the precision, recall, F-measure, and normalized edit distance, respectively. “*” denotes the methods in competition [<a class="ref-link" id="c32" href="#r32">32</a>], which use extra datasets, multi-scale testing, and model ensemble. “800×” means that the short side of input images is scaled to 800
- Table6: The time cost of all modules in AE TextSpotter
- Scene text detection has been a research hotspot in computer vision for a long period. Methods based on deep learning have become the mainstream of scene text detection. Tian et al  and Liao et al  successfully adopted the framework of object detection into text detection and achieved good performance on horizontal text detection. After that, many works [33,24,4,14,17] took the orientation of text lines into consideration and make it possible to detect arbitrary-oriented text lines. Recently, curved text detection attracted increasing attention, and segmentation-based methods [20,12,30,31] achieved excellent performances over the curved text benchmarks. These methods improve the performance of text detection to a high level, but none of them can deal with the ambiguity problem in text detection. In this work, we introduce linguistic features in the text detection module to solve the text detection ambiguity problem.
- This work is supported by the Natural Science Foundation of China under Grant 61672273 and Grant 61832008, the Science Foundation for Distinguished Young Scholars of Jiangsu under Grant BK20160021, and Scientific Foundation of State Grid Corporation of China (Research on Ice-wind Disaster Feature Recognition and Prediction by Few-shot Machine Learning in Transmission Lines)
- Chunhua Shen and his employer received no financial support for the research, authorship and publication of this paper
- Cheng, Z., Bai, F., Xu, Y., Zheng, G., Pu, S., Zhou, S.: Focusing attention: Towards accurate text recognition in natural images. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 5076–5084 (2017)
- Cheng, Z., Xu, Y., Bai, F., Niu, Y., Pu, S., Zhou, S.: Aon: Towards arbitrarilyoriented text recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 5571–5579 (2018)
- Cho, K., Van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
- Deng, D., Liu, H., Li, X., Cai, D.: Pixellink: Detecting scene text via instance segmentation. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
- Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A largescale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009)
- Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
- Feng, W., He, W., Yin, F., Zhang, X.Y., Liu, C.L.: Textdragon: An end-to-end framework for arbitrary shaped text spotting. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 9076–9085 (2019)
- Graves, A., Fernandez, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd international conference on Machine learning. pp. 369– 376. ACM (2006)
- He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision. pp. 2961–2969 (2017)
- He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
- Li, H., Wang, P., Shen, C.: Towards end-to-end text spotting with convolutional recurrent neural networks. arXiv preprint arXiv:1707.03985 (2017)
- Li, X., Wang, W., Hou, W., Liu, R.Z., Lu, T., Yang, J.: Shape robust text detection with progressive scale expansion network. arXiv preprint arXiv:1806.02559 (2018)
- Liao, M., Lyu, P., He, M., Yao, C., Wu, W., Bai, X.: Mask textspotter: An endto-end trainable neural network for spotting text with arbitrary shapes. IEEE transactions on pattern analysis and machine intelligence (2019)
- Liao, M., Shi, B., Bai, X.: Textboxes++: A single-shot oriented scene text detector. IEEE transactions on image processing 27(8), 3676–3690 (2018)
- Liao, M., Shi, B., Bai, X., Wang, X., Liu, W.: Textboxes: A fast text detector with a single deep neural network. In: AAAI. pp. 4161–4167 (2017)
- Lin, T.Y., Dollar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2117–2125 (2017)
- Liu, J., Liu, X., Sheng, J., Liang, D., Li, X., Liu, Q.: Pyramid mask text detector. arXiv preprint arXiv:1903.11800 (2019)
- Liu, W., Chen, C., Wong, K.Y.K., Su, Z., Han, J.: Star-net: A spatial attention residue network for scene text recognition. In: BMVC. vol. 2, p. 7 (2016)
- Liu, X., Liang, D., Yan, S., Chen, D., Qiao, Y., Yan, J.: Fots: Fast oriented text spotting with a unified network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 5676–5685 (2018)
- Long, S., Ruan, J., Zhang, W., He, X., Wu, W., Yao, C.: Textsnake: A flexible representation for detecting text of arbitrary shapes. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 20–36 (2018)
- Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, highperformance deep learning library. In: Advances in Neural Information Processing Systems. pp. 8024–8035 (2019)
- Qin, S., Bissacco, A., Raptis, M., Fujii, Y., Xiao, Y.: Towards unconstrained endto-end text spotting. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 4704–4714 (2019)
- Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems. pp. 91–99 (2015)
- Shi, B., Bai, X., Belongie, S.: Detecting oriented text in natural images by linking segments. arXiv preprint arXiv:1703.06520 (2017)
- Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE transactions on pattern analysis and machine intelligence 39(11), 2298–2304 (2016)
- Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., Bai, X.: Aster: An attentional scene text recognizer with flexible rectification. IEEE transactions on pattern analysis and machine intelligence (2018)
- Tian, Z., Huang, W., He, T., He, P., Qiao, Y.: Detecting text in natural image with connectionist text proposal network. In: European Conference on Computer Vision. pp. 56–72.
- Wang, J., Hu, X.: Gated recurrent convolution neural network for ocr. In: Advances in Neural Information Processing Systems. pp. 335–344 (2017)
- Wang, W., Xie, E., Li, X., Hou, W., Lu, T., Yu, G., Shao, S.: Shape robust text detection with progressive scale expansion network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 9336–9345 (2019)
- Wang, W., Xie, E., Song, X., Zang, Y., Wang, W., Lu, T., Yu, G., Shen, C.: Efficient and accurate arbitrary-shaped text detection with pixel aggregation network. In: Proceedings of the IEEE International Conference on Computer Vision (2019)
- Xie, E., Zang, Y., Shao, S., Yu, G., Yao, C., Li, G.: Scene text detection with supervised pyramid context network. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 33, pp. 9038–9045 (2019)
- Zhang, R., Zhou, Y., Jiang, Q., Song, Q., Li, N., Zhou, K., Wang, L., Wang, D., Liao, M., Yang, M., et al.: Icdar 2019 robust reading challenge on reading chinese text on signboard. In: 2019 International Conference on Document Analysis and Recognition (ICDAR). pp. 1577–1581. IEEE (2019)
- Zhou, X., Yao, C., Wen, H., Wang, Y., Zhou, S., He, W., Liang, J.: East: An efficient and accurate scene text detector. arXiv preprint arXiv:1704.03155 (2017)