Building high-level features using large scale unsupervised learning

Acoustics, Speech and Signal Processing, 2013, Pages 8595-8598.

Cited by: 2047|Bibtex|Views299|DOI:https://doi.org/10.1109/ICASSP.2013.6639343
EI WOS
Other Links: academic.microsoft.com|arxiv.org
Weibo:
Our work shows that it is possible to train neurons to be selective for high-level concepts using entirely unlabeled data

Abstract:

We consider the problem of building high-level, class-specific feature detectors from only unlabeled data. For example, is it possible to learn a face detector using only unlabeled images? To answer this, we train a deep sparse autoencoder on a large dataset of images (the model has 1 billion connections, the dataset has 10 million 200×20...More

Code:

Data:

0
Introduction
  • The focus of this work is to build high-level, classspecific feature detectors from unlabeled images.
  • The authors would like to understand if it is possible to build a face detector from only unlabeled images.
  • Approaches that make use of inexpensive unlabeled data are often preferred, they have not been shown to work well for building high-level features
Highlights
  • The focus of this work is to build high-level, classspecific feature detectors from unlabeled images
  • We would like to understand if it is possible to build a face detector from only unlabeled images. This approach is inspired by the neuroscientific conjecture that there exist highly class-specific neurons in the human brain, generally and informally known as “grandmother neurons.” The extent of class-specificity of neurons in the brain is an area of active investigation, but current experimental evidence suggests the possibility that some neurons in the temporal cortex are highly selective for object categories such as faces or hands (Desimone et al, 1984), and perhaps even specific people (Quiroga et al, 2005)
  • Contemporary computer vision methodology typically emphasizes the role of labeled data to obtain these class-specific feature detectors
  • To build a face detector, one needs a large collection of images labeled as containing faces, often with a bounding box around the face
  • Our implementation scales to a cluster with thousands of machines thanks to model parallelism and asynchronous SGD
  • On ImageNet with 20K categories, our method achieved a 70% relative improvement over the highest other result of which we are aware (including unpublished results known to the authors of (Weston et al, 2011))
  • Our work shows that it is possible to train neurons to be selective for high-level concepts using entirely unlabeled data
Methods
  • The authors describe the analysis of the learned representations in recognizing faces (“the face detector”) and present control experiments to understand invariance properties of the face detector.
  • Results for other concepts are presented .
  • Test set.
  • The test set consists of 37,000 images sampled from two datasets: Labeled Faces In the Wild dataset (Huang et al, 2007) and ImageNet dataset (Deng et al, 2009).
Results
  • On ImageNet with 20,000 categories, the authors achieved 15.8% accuracy, a relative improvement of 70% over the stateof-the-art.
  • To check the proportion of faces in the dataset, the authors run an OpenCV face detector on 60x60 randomly-sampled patches from the dataset.
  • This experiment shows that patches, being detected as faces by the OpenCV face detector, account for less than 3% of the 100,000 sampled patches.
  • On ImageNet with 20K categories, the method achieved a 70% relative improvement over the highest other result of which the authors are aware (including unpublished results known to the authors of (Weston et al, 2011))
Conclusion
  • The authors simulated high-level class-specific neurons using unlabeled data.
  • The authors obtained neurons that function as detectors for faces, human bodies, and cat faces by training on random frames of YouTube videos.
  • These neurons naturally capture complex invariances such as out-of-plane and scale invariances
Summary
  • Introduction:

    The focus of this work is to build high-level, classspecific feature detectors from unlabeled images.
  • The authors would like to understand if it is possible to build a face detector from only unlabeled images.
  • Approaches that make use of inexpensive unlabeled data are often preferred, they have not been shown to work well for building high-level features
  • Methods:

    The authors describe the analysis of the learned representations in recognizing faces (“the face detector”) and present control experiments to understand invariance properties of the face detector.
  • Results for other concepts are presented .
  • Test set.
  • The test set consists of 37,000 images sampled from two datasets: Labeled Faces In the Wild dataset (Huang et al, 2007) and ImageNet dataset (Deng et al, 2009).
  • Results:

    On ImageNet with 20,000 categories, the authors achieved 15.8% accuracy, a relative improvement of 70% over the stateof-the-art.
  • To check the proportion of faces in the dataset, the authors run an OpenCV face detector on 60x60 randomly-sampled patches from the dataset.
  • This experiment shows that patches, being detected as faces by the OpenCV face detector, account for less than 3% of the 100,000 sampled patches.
  • On ImageNet with 20K categories, the method achieved a 70% relative improvement over the highest other result of which the authors are aware (including unpublished results known to the authors of (Weston et al, 2011))
  • Conclusion:

    The authors simulated high-level class-specific neurons using unlabeled data.
  • The authors obtained neurons that function as detectors for faces, human bodies, and cat faces by training on random frames of YouTube videos.
  • These neurons naturally capture complex invariances such as out-of-plane and scale invariances
Tables
  • Table1: Summary of numerical comparisons between our algorithm against other baselines. Top: Our algorithm vs. simple baselines. Here, the first three columns are results for methods that do not require training: random guess, random weights (of the network at initialization, without any training) and best linear filters selected from 100,000 examples sampled from the training set. The last three columns are results for methods that have training: the best neuron in the first layer, the best neuron in the highest layer after training, the best neuron in the network when the contrast normalization layers are removed. Bottom: Our algorithm vs. autoencoders and K-means
  • Table2: Summary of classification accuracies for our method and other state-of-the-art baselines on ImageNet. Dataset version 2009 (∼9M images, ∼10K categories) 2011 (∼16M images, ∼20K categories)
Download tables as Excel
Funding
  • Finds that the same network is sensitive to other high-level concepts such as cat faces and human bodies
  • Addresses this problem by scaling up the core components involved in training deep networks: the dataset, the model, and the computational resources
  • Achieved 15.8% accuracy, a relative improvement of 70% over the stateof-the-art
  • Runs an OpenCV face detector on 60x60 randomly-sampled patches from the dataset . This experiment shows that patches, being detected as faces by the OpenCV face detector, account for less than 3% of the 100,000 sampled patches
Reference
  • Bengio, Y. and LeCun, Y. Scaling learning algorithms towards AI. In Large-Scale Kernel Machines, 2007.
    Google ScholarLocate open access versionFindings
  • Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. Greedy layerwise training of deep networks. In NIPS, 2007.
    Google ScholarLocate open access versionFindings
  • Berkes, P. and Wiskott, L. Slow feature analysis yields a rich repertoire of complex cell properties. Journal of Vision, 2005.
    Google ScholarLocate open access versionFindings
  • Ciresan, D. C., Meier, U., Gambardella, L. M., and Schmidhuber, J. Deep big simple neural nets excel on handwritten digit recognition. CoRR, 2010.
    Google ScholarLocate open access versionFindings
  • Coates, A., Lee, H., and Ng, A. Y. An analysis of singlelayer networks in unsupervised feature learning. In AISTATS 14, 2011.
    Google ScholarLocate open access versionFindings
  • Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and FeiFei, L. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, 2009.
    Google ScholarLocate open access versionFindings
  • Deng, J., Berg, A., Li, K., and Fei-Fei, L. What does classifying more than 10,000 image categories tell us? In ECCV, 2010.
    Google ScholarLocate open access versionFindings
  • Desimone, R., Albright, T., Gross, C., and Bruce, C. Stimulus-selective properties of inferior temporal neurons in the macaque. The Journal of Neuroscience, 1984.
    Google ScholarLocate open access versionFindings
  • DiCarlo, J. J., Zoccolan, D., and Rust, N. C. How does the brain solve visual object recognition? Neuron, 2012.
    Google ScholarLocate open access versionFindings
  • Erhan, D., Bengio, Y., Courville, A., and Vincent, P. Visualizing higher-layer features of deep networks. Technical report, University of Montreal, 2009.
    Google ScholarFindings
  • Fukushima, K. and Miyake, S. Neocognitron: A new algorithm for pattern recognition tolerant of deformations and shifts in position. Pattern Recognition, 1982.
    Google ScholarLocate open access versionFindings
  • Hinton, G. E. and Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science, 2006.
    Google ScholarLocate open access versionFindings
  • Hinton, G. E., Osindero, S., and Teh, Y. W. A fast learning algorithm for deep belief nets. Neural Computation, 2006.
    Google ScholarLocate open access versionFindings
  • Huang, G. B., Ramesh, M., Berg, T., and Learned-Miller, E. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical Report 07-49, University of Massachusetts, Amherst, October 2007.
    Google ScholarFindings
  • Hubel, D. H. and Wiesel, T.N. Receptive fields of single neurons in the the cat’s visual cortex. Journal of Physiology, 1959.
    Google ScholarLocate open access versionFindings
  • Hyvarinen, A., Hurri, J., and Hoyer, P. O. Natural Image Statistics. Springer, 2009.
    Google ScholarLocate open access versionFindings
  • Jarrett, K., Kavukcuoglu, K., Ranzato, M.A., and LeCun, Y. What is the best multi-stage architecture for object recognition? In ICCV, 2009.
    Google ScholarLocate open access versionFindings
  • Keller, C., Enzweiler, M., and Gavrila, D. M. A new benchmark for stereo-based pedestrian detection. In Proc. of the IEEE Intelligent Vehicles Symposium, 2009.
    Google ScholarLocate open access versionFindings
  • Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
    Google ScholarFindings
  • Le, Q. V., Ngiam, J., Chen, Z., Chia, D., Koh, P. W., and Ng, A. Y. Tiled convolutional neural networks. In NIPS, 2010.
    Google ScholarLocate open access versionFindings
  • Le, Q. V., Karpenko, A., Ngiam, J., and Ng, A. Y. ICA with Reconstruction Cost for Efficient Overcomplete Feature Learning. In NIPS, 2011a.
    Google ScholarLocate open access versionFindings
  • Le, Q.V., Ngiam, J., Coates, A., Lahiri, A., Prochnow, B., and Ng, A.Y. On optimization methods for deep learning. In ICML, 2011b.
    Google ScholarLocate open access versionFindings
  • LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient based learning applied to document recognition. Proceeding of the IEEE, 1998.
    Google ScholarLocate open access versionFindings
  • Lee, H., Battle, A., Raina, R., and Ng, Andrew Y. Efficient sparse coding algorithms. In NIPS, 2007.
    Google ScholarLocate open access versionFindings
  • Lee, H., Ekanadham, C., and Ng, A. Y. Sparse deep belief net model for visual area V2. In NIPS, 2008.
    Google ScholarLocate open access versionFindings
  • Lee, H., Grosse, R., Ranganath, R., and Ng, A.Y. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In ICML, 2009.
    Google ScholarLocate open access versionFindings
  • Lyu, S. and Simoncelli, E. P. Nonlinear image representation using divisive normalization. In CVPR, 2008.
    Google ScholarLocate open access versionFindings
  • Olshausen, B. and Field, D. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 1996.
    Google ScholarLocate open access versionFindings
  • Pakkenberg, B., P., D., Marner, L., Bundgaard, M. J., Gundersen, H. J. G., Nyengaard, J. R., and Regeur, L. Aging and the human neocortex. Experimental Gerontology, 2003.
    Google ScholarLocate open access versionFindings
  • Pinto, N., Cox, D. D., and DiCarlo, J. J. Why is real-world visual object recognition hard? PLoS Computational Biology, 2008.
    Google ScholarLocate open access versionFindings
  • Quiroga, R. Q., Reddy, L., Kreiman, G., Koch, C., and Fried, I. Invariant visual representation by single neurons in the human brain. Nature, 2005.
    Google ScholarLocate open access versionFindings
  • Raina, R., Battle, A., Lee, H., Packer, B., and Ng, A.Y. Self-taught learning: Transfer learning from unlabelled data. In ICML, 2007.
    Google ScholarFindings
  • Raina, R., Madhavan, A., and Ng, A. Y. Large-scale deep unsupervised learning using graphics processors. In ICML, 2009.
    Google ScholarLocate open access versionFindings
  • Ranzato, M., Huang, F. J, Boureau, Y., and LeCun, Y. Unsupervised learning of invariant feature hierarchies with applications to object recognition. In CVPR, 2007.
    Google ScholarFindings
  • Riesenhuber, M. and Poggio, T. Hierarchical models of object recognition in cortex. Nature Neuroscience, 1999.
    Google ScholarLocate open access versionFindings
  • Sanchez, J. and Perronnin, F. High-dimensional signature compression for large-scale image-classification. In CVPR, 2011.
    Google ScholarLocate open access versionFindings
  • Sermanet, P. and LeCun, Y. Traffic sign recognition with multiscale convolutional neural networks. In IJCNN, 2011.
    Google ScholarLocate open access versionFindings
  • Weston, J., Bengio, S., and Usunier, N. Wsabie: Scaling up to large vocabulary image annotation. In IJCAI, 2011.
    Google ScholarLocate open access versionFindings
  • Zhang, W., Sun, J., and Tang, X. Cat head detection how to effectively exploit shape and texture features. In ECCV, 2008.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments