AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
We have presented a novel neural network architecture, called Convolutional Recurrent Neural Network, which integrates the advantages of both Convolutional Neural Networks and Recurrent Neural Networks

An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

IEEE Transactions on Pattern Analysis and Machine Intelligence, no. 11 (2017): 2298-2304

被引用843|浏览200
EI WOS
下载 PDF 全文
引用
微博一下

摘要

Image-based sequence recognition has been a long-standing research topic in computer vision. In this paper, we investigate the problem of scene text recognition, which is among the most important and challenging tasks in image-based sequence recognition. A novel neural network architecture, which integrates feature extraction, sequence mo...更多

代码

数据

简介
  • The community has seen a strong revival of neural networks, which is mainly stimulated by the great success of deep neural network models, Deep Convolutional Neural Networks (DCNN), in various vision tasks.
  • Unlike general object recognition, recognizing such sequence-like objects often requires the system to predict a series of object labels, instead of a single label.
  • Recognition of such objects can be naturally cast as a sequence recognition problem.
  • Another unique property of sequence-like objects is that their lengths may vary drastically.
  • The most popular deep models like DCNN [25, 26] cannot be directly applied to sequence prediction, since DCNN models often operate on inputs and outputs with fixed dimensions, and are incapable of producing a variable-length label sequence
重点内容
  • The community has seen a strong revival of neural networks, which is mainly stimulated by the great success of deep neural network models, Deep Convolutional Neural Networks (DCNN), in various vision tasks
  • We have presented a novel neural network architecture, called Convolutional Recurrent Neural Network (CRNN), which integrates the advantages of both Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN)
  • As Convolutional Recurrent Neural Network abandons fully connected layers used in conventional neural networks, it results in a much more compact and efficient model
  • The experiments on the scene text recognition benchmarks demonstrate that Convolutional Recurrent Neural Network achieves superior or highly competitive performance, compared with conventional methods as well as other Convolutional Neural Networks and Recurrent neural networks based algorithms
  • Convolutional Recurrent Neural Network significantly outperforms other competitors on a benchmark for Optical Music Recognition (OMR), which verifies the generality of Convolutional Recurrent Neural Network
方法
  • To evaluate the effectiveness of the proposed CRNN model, the authors conducted experiments on standard benchmarks for scene text recognition and musical score recognition, which are both challenging vision tasks.
  • The dataset contains 8 millions training images and their corresponding ground truth words.
  • Such images are generated by a synthetic text engine and are highly realistic.
  • Even though the CRNN model is purely trained with synthetic text data, it works well on real images from standard text recognition benchmarks
结果
  • In the unconstrained lexicon cases, the method achieves the best performance on SVT, yet, is still behind some approaches [8, 22] on IC03 and IC13.
结论
  • The authors have presented a novel neural network architecture, called Convolutional Recurrent Neural Network (CRNN), which integrates the advantages of both Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN).
  • The experiments on the scene text recognition benchmarks demonstrate that CRNN achieves superior or highly competitive performance, compared with conventional methods as well as other CNN and RNN based algorithms.
  • This confirms the advantages of the proposed algorithm.
  • CRNN significantly outperforms other competitors on a benchmark for Optical Music Recognition (OMR), which verifies the generality of CRNN
表格
  • Table1: Network configuration summary. The first row is the top layer. ‘k’, ‘s’ and ‘p’ stand for kernel size, stride and padding size respectively Type
  • Table2: Recognition accuracies (%) on four datasets. In the second row, “50”, “1k”, “50k” and “Full” denote the lexicon used, and “None” denotes recognition without a lexicon. (*[<a class="ref-link" id="c22" href="#r22">22</a>] is not lexicon-free in the strict sense, as its outputs are constrained to a 90k dictionary
  • Table3: Comparison among various methods. Attributes for comparison include: 1) being end-to-end trainable (E2E Train); 2) using convolutional features that are directly learned from images rather than using hand-crafted ones (Conv Ftrs); 3) requiring no ground truth bounding boxes for characters during training (CharGT-Free); 4) not confined to a pre-defined dictionary (Unconstrained); 5) the model size (if an end-to-end trainable model is used), measured by the number of model parameters (Model Size, M stands for millions)
  • Table4: Comparison of pitch recognition accuracies, among
Download tables as Excel
基金
  • This work was primarily supported by National Natural Science Foundation of China (NSFC) (No 61222308)
研究对象与分析
public datasets: 4
Comparative Evaluation. All the recognition accuracies on the above four public datasets, obtained by the proposed CRNN model and the recent state-of-the-arts techniques including the approaches based on deep models [23, 22, 21], are shown in Table 2. In the constrained lexicon cases, our method consistently outperforms most state-of-the-arts approaches, and in average beats the best text reader proposed in [22]

datasets: 3
The collected images are augmented to 265k training samples by being rotated, scaled and corrupted with noise, and by replacing their backgrounds with natural images. For testing, we create three datasets: 1) “Clean”, which contains 260 images collected from [2]. Examples are shown in Fig. 5.a; 2) “Synthesized”, which is created from “Clean”, using the augmentation strategy mentioned above

samples: 200
Examples are shown in Fig. 5.a; 2) “Synthesized”, which is created from “Clean”, using the augmentation strategy mentioned above. It contains 200 samples, some of which are shown in Fig. 5.b; 3) “Real-World”, which contains 200 images of score fragments taken from music books with a phone camera. Examples are shown in Fig. 5.c.1

datasets: 4
Network configuration summary. The first row is the top layer. ‘k’, ‘s’ and ‘p’ stand for kernel size, stride and padding size respectively Type. Recognition accuracies (%) on four datasets. In the second row, “50”, “1k”, “50k” and “Full” denote the lexicon used, and “None” denotes recognition without a lexicon. (*[22] is not lexicon-free in the strict sense, as its outputs are constrained to a 90k dictionary. Comparison among various methods. Attributes for comparison include: 1) being end-to-end trainable (E2E Train); 2) using convolutional features that are directly learned from images rather than using hand-crafted ones (Conv Ftrs); 3) requiring no ground truth bounding boxes for characters during training (CharGT-Free); 4) not confined to a pre-defined dictionary (Unconstrained); 5) the model size (if an end-to-end trainable model is used), measured by the number of model parameters (Model Size, M stands for millions)

引用论文
  • http://hunspell.sourceforge.net/.4, 5
    Findings
  • https://musescore.com/sheetmusic.7, 8
    Findings
  • http://www.capella.de/us/index.
    Findings
  • http://www.sibelius.com/products/photoscore/ultimate.html.8
    Findings
  • J. Almazan, A. Gordo, A. Fornes, and E. Valveny. Word spotting and recognition with embedded attributes. PAMI, 36(12):2552–2566, 2014. 2, 6, 7
    Google ScholarLocate open access versionFindings
  • O. Alsharif and J. Pineau. End-to-end text recognition with hybrid HMM maxout models. ICLR, 2014. 6, 7
    Google ScholarLocate open access versionFindings
  • Y. Bengio, P. Y. Simard, and P. Frasconi. Learning longterm dependencies with gradient descent is difficult. NN, 5(2):157–166, 1994. 3
    Google ScholarLocate open access versionFindings
  • A. Bissacco, M. Cummins, Y. Netzer, and H. Neven. Photoocr: Reading text in uncontrolled conditions. In ICCV, 2013. 1, 2, 6, 7
    Google ScholarLocate open access versionFindings
  • W. A. Burkhard and R. M. Keller. Some approaches to bestmatch file searching. Commun. ACM, 16(4):230–236, 1973. 4
    Google ScholarLocate open access versionFindings
  • R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A matlab-like environment for machine learning. In BigLearn, NIPS Workshop, 2011. 6
    Google ScholarLocate open access versionFindings
  • F. A. Gers, N. N. Schraudolph, and J. Schmidhuber. Learning precise timing with LSTM recurrent networks. JMLR, 3:115–143, 2002. 3
    Google ScholarLocate open access versionFindings
  • R. B. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014. 1, 3
    Google ScholarLocate open access versionFindings
  • V. Goel, A. Mishra, K. Alahari, and C. V. Jawahar. Whole is greater than sum of parts: Recognizing scene text words. In ICDAR, 206, 7
    Google ScholarLocate open access versionFindings
  • A. Gordo. Supervised mid-level features for word image representation. In CVPR, 2015. 2, 6, 7
    Google ScholarLocate open access versionFindings
  • A. Graves, S. Fernandez, F. J. Gomez, and J. Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In ICML, 2006. 4, 5
    Google ScholarLocate open access versionFindings
  • A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, and J. Schmidhuber. A novel connectionist system for unconstrained handwriting recognition. PAMI, 31(5):855–868, 2009. 2
    Google ScholarLocate open access versionFindings
  • A. Graves, A. Mohamed, and G. E. Hinton. Speech recognition with deep recurrent neural networks. In ICASSP, 2013. 3
    Google ScholarFindings
  • S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997. 3
    Google ScholarLocate open access versionFindings
  • S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015. 6
    Google ScholarLocate open access versionFindings
  • M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Synthetic data and artificial neural networks for natural scene text recognition. NIPS Deep Learning Workshop, 2014. 5
    Google ScholarFindings
  • M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Deep structured output learning for unconstrained text recognition. In ICLR, 2015. 6, 7
    Google ScholarLocate open access versionFindings
  • M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Reading text in the wild with convolutional neural networks. IJCV (Accepted), 2015. 1, 2, 3, 6, 7
    Google ScholarLocate open access versionFindings
  • M. Jaderberg, A. Vedaldi, and A. Zisserman. Deep features for text spotting. In ECCV, 2014. 2, 6, 7
    Google ScholarLocate open access versionFindings
  • D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i Bigorda, S. R. Mestre, J. Mas, D. F. Mota, J. Almazan, and L. de las Heras. ICDAR 2013 robust reading competition. In ICDAR, 2013. 5
    Google ScholarLocate open access versionFindings
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012. 1, 3
    Google ScholarLocate open access versionFindings
  • Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. 1
    Google ScholarLocate open access versionFindings
  • S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, R. Young, K. Ashida, H. Nagai, M. Okamoto, H. Yamamoto, H. Miyao, J. Zhu, W. Ou, C. Wolf, J. Jolion, L. Todoran, M. Worring, and X. Lin. ICDAR 2003 robust reading competitions: entries, results, and future directions. IJDAR, 7(23):105–122, 2005. 5
    Google ScholarLocate open access versionFindings
  • A. Mishra, K. Alahari, and C. V. Jawahar. Scene text recognition using higher order language priors. In BMVC, 2012. 5, 6, 7
    Google ScholarLocate open access versionFindings
  • A. Rebelo, I. Fujinaga, F. Paszkiewicz, A. R. S. Marcal, C. Guedes, and J. S. Cardoso. Optical music recognition: state-of-the-art and open issues. IJMIR, 1(3):173–190, 2012. 7
    Google ScholarLocate open access versionFindings
  • J. A. Rodrıguez-Serrano, A. Gordo, and F. Perronnin. Label embedding: A frugal baseline for text recognition. IJCV, 113(3):193–207, 2015. 2, 6, 7
    Google ScholarLocate open access versionFindings
  • D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Neurocomputing: Foundations of research. chapter Learning Representations by Back-propagating Errors, pages 696–699. MIT Press, 1988. 5
    Google ScholarFindings
  • K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014. 5
    Findings
  • B. Su and S. Lu. Accurate scene text recognition based on recurrent neural network. In ACCV, 2014. 2, 6, 7
    Google ScholarLocate open access versionFindings
  • K. Wang, B. Babenko, and S. Belongie. End-to-end scene text recognition. In ICCV, 2011. 5, 6, 7
    Google ScholarLocate open access versionFindings
  • T. Wang, D. J. Wu, A. Coates, and A. Y. Ng. End-to-end text recognition with convolutional neural networks. In ICPR, 2012. 1, 6, 7
    Google ScholarLocate open access versionFindings
  • C. Yao, X. Bai, B. Shi, and W. Liu. Strokelets: A learned multi-scale representation for scene text recognition. In CVPR, 2014. 2, 6, 7
    Google ScholarLocate open access versionFindings
  • M. D. Zeiler. ADADELTA: an adaptive learning rate method. CoRR, abs/1212.5701, 2012. 5
    Findings
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科