Convolutional neural networks applied to house numbers digit classification

international conference on pattern recognition, Volume abs/1204.3968, 2012, Pages 3288-3291.

Cited by: 323|Bibtex|Views174
EI WOS
Other Links: academic.microsoft.com|dblp.uni-trier.de|arxiv.org
Weibo:
We show that using multi-stage features gives only a slight increase in performance, compared to the performance increase seen in other vision applications

Abstract:

We classify digits of real-world house numbers using convolutional neural networks (ConvNets). Con-vNets are hierarchical feature learning neural networks whose structure is biologically inspired. Unlike many popular vision approaches that are hand-designed, ConvNets can automatically learn a unique set of features optimized for a given t...More

Code:

Data:

0
Introduction
  • Character recognition in documents can be considered a solved task for computer vision, whether handwritten or typed.
  • ConvNets learn features all the way from pixels to the classifier.
  • 32x32 cropped samples from the classification task of the SVHN dataset.
  • Was previously shown among others in a traffic sign classification challenge [13] where two independent teams obtained the best performance against various other approaches using ConvNets [11, 2].
Highlights
  • Character recognition in documents can be considered a solved task for computer vision, whether handwritten or typed
  • [8] recently introduced a new digit classification dataset of house numbers extracted from street level images
  • We use the traditional ConvNet architecture augmented with different pooling methods and with multi-stage features [11]
  • The performance of different pooling methods on the validation set can be seen in Figure 5
  • We show that using multi-stage features gives only a slight increase in performance, compared to the performance increase seen in other vision applications
Results
  • The authors use the traditional ConvNet architecture augmented with different pooling methods and with multi-stage features [11].
  • The ConvNet architecture is composed of repeatedly stacked feature stages.
  • Multi-Stage features (MS) are obtained by branching out outputs of all stages into the classifier (Figure 3).
  • They provide richer representations compared to Single-Stage features (SS) by adding complementary information such as local textures and fine details lost by higher levels.
  • MS features have consistently improved performance in other work [4, 11, 9] and in this work as well (Figure 4).
  • The authors observe minimal gains on this dataset compared to other types of objects such as pedestrians and traffic signs (Table 1).
  • The ConvNet has 2 stages of feature extraction and a two-layer non-linear classifier.
  • The output to the classifier includes inputs from the first layer, which provides local features/motifs to reinforce the global features.
  • 4 Results & Future Work ularization constant and learning rate decay were tuned on the validation set.
  • The authors compare Lp-pooling for the value p = 1, 2, 4, 8, 12, 16, 32, ∞ on the validation set and use the best performing pooling on the final testing.
  • The performance of different pooling methods on the validation set can be seen in Figure 5.
  • Max-pooling, which corresponds to p = ∞ yielded a validation error rate of 7.57%.
  • The authors' experiments demonstrate a clear advantage of Lp pooling with 1 < p < ∞ on this dataset in validation (Figure 5) and test (Average pooling is 3.58 points inferior to L2 pooling in Table 2).
Conclusion
  • With L4 pooling, the authors obtain a state-of-the-art performance on the test set with an accuracy of 94.85% compared to the previous best of 90.6% (Table 2).
  • The authors show that using multi-stage features gives only a slight increase in performance, compared to the performance increase seen in other vision applications.
  • It is important to note that the approach is trained fully supervised only, whereas the best previous methods are unsupervised learning methods.
  • In the future, run experiments with unsupervised learning, to compare the accuracy improvement that can be attributed to supervision.
Summary
  • Character recognition in documents can be considered a solved task for computer vision, whether handwritten or typed.
  • ConvNets learn features all the way from pixels to the classifier.
  • 32x32 cropped samples from the classification task of the SVHN dataset.
  • Was previously shown among others in a traffic sign classification challenge [13] where two independent teams obtained the best performance against various other approaches using ConvNets [11, 2].
  • The authors use the traditional ConvNet architecture augmented with different pooling methods and with multi-stage features [11].
  • The ConvNet architecture is composed of repeatedly stacked feature stages.
  • Multi-Stage features (MS) are obtained by branching out outputs of all stages into the classifier (Figure 3).
  • They provide richer representations compared to Single-Stage features (SS) by adding complementary information such as local textures and fine details lost by higher levels.
  • MS features have consistently improved performance in other work [4, 11, 9] and in this work as well (Figure 4).
  • The authors observe minimal gains on this dataset compared to other types of objects such as pedestrians and traffic signs (Table 1).
  • The ConvNet has 2 stages of feature extraction and a two-layer non-linear classifier.
  • The output to the classifier includes inputs from the first layer, which provides local features/motifs to reinforce the global features.
  • 4 Results & Future Work ularization constant and learning rate decay were tuned on the validation set.
  • The authors compare Lp-pooling for the value p = 1, 2, 4, 8, 12, 16, 32, ∞ on the validation set and use the best performing pooling on the final testing.
  • The performance of different pooling methods on the validation set can be seen in Figure 5.
  • Max-pooling, which corresponds to p = ∞ yielded a validation error rate of 7.57%.
  • The authors' experiments demonstrate a clear advantage of Lp pooling with 1 < p < ∞ on this dataset in validation (Figure 5) and test (Average pooling is 3.58 points inferior to L2 pooling in Table 2).
  • With L4 pooling, the authors obtain a state-of-the-art performance on the test set with an accuracy of 94.85% compared to the previous best of 90.6% (Table 2).
  • The authors show that using multi-stage features gives only a slight increase in performance, compared to the performance increase seen in other vision applications.
  • It is important to note that the approach is trained fully supervised only, whereas the best previous methods are unsupervised learning methods.
  • In the future, run experiments with unsupervised learning, to compare the accuracy improvement that can be attributed to supervision.
Tables
  • Table1: Error rates improvements of multi-stage features over single-stage features for different types of objects detection and classification. Improvements are significant for multi-scale and textured objects such as traffic signs and pedestrians but minimal for house numbers
  • Table2: Performance reported by [<a class="ref-link" id="c8" href="#r8">8</a>] with the additional Supervised ConvNet with state-of-the-art accuracy of 94.85%
Download tables as Excel
Funding
  • Convolutional Neural Networks Applied to House Numbers Digit Classification
  • Unlike many popular vision approaches that are handdesigned, ConvNets can automatically learn a unique set of features optimized for a given task
  • Character recognition in documents can be considered a solved task for computer vision, whether handwritten or typed
  • Shows that using multi-stage features gives only a slight increase in performance, compared to the performance increase seen in other vision applications
Reference
  • Y. Boureau, J. Ponce, and Y. LeCun. A theoretical analysis of feature pooling in vision algorithms. In Proc. International Conference on Machine learning, 2010.
    Google ScholarLocate open access versionFindings
  • D. C. Ciresan, U. Meier, J. Masci, and J. Schmidhuber. A committee of neural networks for traffic sign classification. In International Joint Conference on Neural Networks, pages 1918–1921, 2011.
    Google ScholarLocate open access versionFindings
  • T. E. de Campos, B. R. Babu, and M. Varma. Character recognition in natural images. In Proceedings of the International Conference on Computer Vision Theory and Applications, Lisbon, Portugal, February 2009.
    Google ScholarLocate open access versionFindings
  • J. Fan, W. Xu, Y. Wu, and Y. Gong. Human tracking using convolutional neural networks. Neural Networks, IEEE Transactions on, 21(10):1610 –1623, 2010.
    Google ScholarLocate open access versionFindings
  • A. Hyvrinen and U. Kster. Complex cell pooling and the statistics of natural images. In Computation in Neural Systems,, 2005.
    Google ScholarLocate open access versionFindings
  • K. Kavukcuoglu, M. Ranzato, R. Fergus, and Y. LeCun. Learning invariant features through topographic filter maps. In Proc. International Conference on Computer Vision and Pattern Recognition. IEEE, 2009.
    Google ScholarLocate open access versionFindings
  • Y. Lecun and C. Cortes. The MNIST database of handwritten digits.
    Google ScholarFindings
  • Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
    Google ScholarLocate open access versionFindings
  • P. Sermanet, K. Kavukcuoglu, and Y. LeCun. Traffic signs and pedestrians vision with multi-scale convolutional networks. In Snowbird Machine Learning Workshop, 2011.
    Google ScholarLocate open access versionFindings
  • P. Sermanet, K. Kavukcuoglu, and Y. LeCun. Eblearn: Open-source energy-based learning in c++. In Proc. International Conference on Tools with Artificial Intelligence. IEEE, 2009.
    Google ScholarLocate open access versionFindings
  • P. Sermanet and Y. LeCun. Traffic sign recognition with multi-scale convolutional networks. In Proceedings of International Joint Conference on Neural Networks, 2011.
    Google ScholarLocate open access versionFindings
  • E. P. Simoncelli and D. J. Heeger. A model of neuronal responses in visual area mt, 1997.
    Google ScholarFindings
  • J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel. The German Traffic Sign Recognition Benchmark: A multi-class classification competition. In IEEE International Joint Conference on Neural Networks, pages 1453–1460, 2011.
    Google ScholarLocate open access versionFindings
  • T. Yamaguchi, Y. Nakano, M. Maruyama, H. Miyao, and T. Hananoi. Digit classification on signboards for telephone number recognition. In ICDAR, pages 359– 363, 2003.
    Google ScholarLocate open access versionFindings
  • J. Yang, K. Yu, Y. Gong, and T. Huang. Linear spatial pyramid matching using sparse coding for image classification. In in IEEE Conference on Computer Vision and Pattern Recognition, 2009.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments