Do Better ImageNet Models Transfer Better?
computer vision and pattern recognition, 2019.
EI
Weibo:
Abstract:
Transfer learning has become a cornerstone of computer vision with the advent of ImageNet features, yet little work has been done to evaluate the performance of ImageNet architectures across different datasets. An implicit hypothesis in modern computer vision research is that models that perform better on ImageNet necessarily perform bett...More
Code:
Data:
Introduction
- The last decade of computer vision research has pursued academic benchmarks as a measure of progress.
- An implicit assumption behind this progress is that network architectures that perform better on ImageNet necessarily perform better on other vision tasks.
- Another assumption is that bet-.
Highlights
- The last decade of computer vision research has pursued academic benchmarks as a measure of progress
- Network architectures measured against this dataset have fueled much progress in computer vision research across a broad array of problems, including transferring to new datasets [17, 56], object detection [32], image segmentation [27, 7] and perceptual metrics of images [35]
- We evaluated models on 12 image classification datasets ranging in training set size from 2,040 to 75,750 images (20 to 5,000 images per class; Table 1)
- On the datasets we examine, we outperform all such methods by finetuning state-of-the-art convolutional neural networks (Supp
Methods
- Much of the analysis in this work requires comparing accuracies across datasets of differing difficulty.
- When fitting linear models to accuracy values across multiple datasets, the authors consider effects of model and dataset to be additive.
- In this context, using untransformed accuracy as a dependent variable is problematic: The meaning of a 1% additive increase in accuracy is different if it is relative to a base accuracy of 50% vs 99%.
- The authors take the mean and standard error of the adjusted accuracy across datasets, and multiply the latter by a correction factor
Results
- The authors examined 16 modern networks ranging in ImageNet (ILSVRC 2012 validation) top-1 accuracy from 71.6% to 80.8%.
- Appendix A.3 provides training hyperparameters along with further details of each network, including the ImageNet top-1 accuracy, parameter count, dimension of the penultimate layer, input image size, and performance of retrained models.
- The authors rescaled images to the same image size as was used for ImageNet training
Conclusion
- The authors' results suggest the answer is no: The authors find that there is a strong correlation between ImageNet top-1 accuracy and transfer accuracy, suggesting that better ImageNet architectures are capable of learning better, transferable representations.
- Examples per Class number of widely-used regularizers that improve ImageNet performance do not produce better representations.
- These regularizers are harmful to the penultimate layer feature space, and have mixed effects when networks are fine-tuned.
Summary
Introduction:
The last decade of computer vision research has pursued academic benchmarks as a measure of progress.- An implicit assumption behind this progress is that network architectures that perform better on ImageNet necessarily perform better on other vision tasks.
- Another assumption is that bet-.
Methods:
Much of the analysis in this work requires comparing accuracies across datasets of differing difficulty.- When fitting linear models to accuracy values across multiple datasets, the authors consider effects of model and dataset to be additive.
- In this context, using untransformed accuracy as a dependent variable is problematic: The meaning of a 1% additive increase in accuracy is different if it is relative to a base accuracy of 50% vs 99%.
- The authors take the mean and standard error of the adjusted accuracy across datasets, and multiply the latter by a correction factor
Results:
The authors examined 16 modern networks ranging in ImageNet (ILSVRC 2012 validation) top-1 accuracy from 71.6% to 80.8%.- Appendix A.3 provides training hyperparameters along with further details of each network, including the ImageNet top-1 accuracy, parameter count, dimension of the penultimate layer, input image size, and performance of retrained models.
- The authors rescaled images to the same image size as was used for ImageNet training
Conclusion:
The authors' results suggest the answer is no: The authors find that there is a strong correlation between ImageNet top-1 accuracy and transfer accuracy, suggesting that better ImageNet architectures are capable of learning better, transferable representations.- Examples per Class number of widely-used regularizers that improve ImageNet performance do not produce better representations.
- These regularizers are harmful to the penultimate layer feature space, and have mixed effects when networks are fine-tuned.
Tables
- Table1: Datasets examined in transfer learning
Related work
- ImageNet follows in a succession of progressively larger and more realistic benchmark datasets for computer vision. Each successive dataset was designed to address perceived issues with the size and content of previous datasets. Torralba and Efros [69] showed that many early datasets were heavily biased, with classifiers trained to recognize or classify objects on those datasets possessing almost no ability to generalize to images from other datasets.
Early work using convolutional neural networks (CNNs) for transfer learning extracted fixed features from ImageNettrained networks and used these features to train SVMs and logistic regression classifiers for new tasks [17, 56, 6]. These features could outperform hand-engineered features even for tasks very distinct from ImageNet classification [17, 56]. Following this work, several studies compared the performance of AlexNet-like CNNs of varying levels of computational complexity in a transfer learning setting with no fine-tuning. Chatfield et al [6] found that, out of three networks, the two more computationally expensive networks performed better on PASCAL VOC. Similar work concluded that deeper networks produce higher accuracy across many transfer tasks, but wider networks produce lower accuracy [2]. More recent evaluation efforts have investigated transfer from modern CNNs to medical image datasets [51], and transfer of sentence embeddings to language tasks [12].
Reference
- Pulkit Agrawal, Ross B. Girshick, and Jitendra Malik. Analyzing the performance of multilayer neural networks for object recognition. In European Conference on Computer Vision (ECCV), 2014.
- H. Azizpour, A. S. Razavian, J. Sullivan, A. Maki, and S. Carlsson. Factors of transferability for a generic convnet representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(9):1790–1802, Sept 2016.
- Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool. Speeded-up robust features (SURF). Computer Vision and Image Understanding, 110(3):346–359, 2008.
- Thomas Berg, Jiongxin Liu, Seung Woo Lee, Michelle L Alexander, David W Jacobs, and Peter N Belhumeur. Birdsnap: Large-scale fine-grained visual categorization of birds. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2019–2026. IEEE, 2014.
- Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101 — mining discriminative components with random forests. In European Conference on Computer Vision (ECCV), pages 446–461.
- Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Return of the devil in the details: delving deep into convolutional nets. In British Machine Vision Conference, 2014.
- Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4):834–848, 2018.
- Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Frank Wang, and Jia-Bin Huang. A closer look at few-shot classification. In International Conference on Learning Representations, 2019.
- Brian Chu, Vashisht Madhavan, Oscar Beijbom, Judy Hoffman, and Trevor Darrell. Best practices for fine-tuning visual classifiers to new domains. In Gang Hua and Hervé Jégou, editors, Computer Vision – ECCV 2016 Workshops, pages 435–442, Cham, 2016. Springer International Publishing.
- Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3606–3613. IEEE, 2014.
- Mircea Cimpoi, Subhransu Maji, and Andrea Vedaldi. Deep filter banks for texture recognition and segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3828–3836. IEEE, 2015.
- Alexis Conneau and Douwe Kiela. Senteval: An evaluation toolkit for universal sentence representations. arXiv preprint arXiv:1803.05449, 2018.
- Yin Cui, Feng Zhou, Jiang Wang, Xiao Liu, Yuanqing Lin, and Serge Belongie. Kernel pooling for convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn: Object detection via region-based fully convolutional networks. In Advances in neural information processing systems, pages 379–387, 2016.
- Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, pages 886–893. IEEE, 2005.
- Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255. IEEE, 2009.
- Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In International Conference on Machine Learning, pages 647–655, 2014.
- Nanqing Dong and Eric P Xing. Domain adaption in oneshot learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 573– 588.
- Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303–338, 2010.
- Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshop on Generative-Model Based Vision, 2004.
- Chelsea Finn, Pieter Abbeel, and Sergey Levine. Modelagnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, pages 1126–1135, 2017.
- Blair Hanley Frank. Google Brain chief: Deep learning takes at least 100,000 examples. In VentureBeat. https://venturebeat.com/2017/10/23/google-brain-chiefsays-100000-examples-is-enough-data-for-deep-learning/, 2017.
- Yang Gao, Oscar Beijbom, Ning Zhang, and Trevor Darrell. Compact bilinear pooling. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 317–326, 2016.
- Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 580–587, 2014.
- Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: training ImageNet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
- Sam Gross and Michael Wilber. Training and investigating residual nets. In The Torch Blog. http://torch.ch/blog/2016/02/04/resnets.html, 2016.
- Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask R-CNN. In IEEE International Conference on Computer Vision (ICCV), pages 2980–2988. IEEE, 2017.
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
- Luis Herranz, Shuqiang Jiang, and Xiangyang Li. Scene recognition with CNNs: objects, scales and dataset bias. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 571–579, 2016.
- Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
- Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261–2269, 2017.
- Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara, Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang Song, Sergio Guadarrama, and Kevin Murphy. Speed/accuracy trade-offs for modern convolutional object detectors. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- Mi-Young Huh, Pulkit Agrawal, and Alexei A. Efros. What makes ImageNet good for transfer learning? CoRR, abs/1608.08614, 2016.
- Sergey Ioffe and Christian Szegedy. Batch normalization: accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448–456, 2015.
- Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision (ECCV), pages 694–711.
- Jonathan Krause, Jia Deng, Michael Stark, and Li Fei-Fei. Collecting a large-scale dataset of fine-grained cars. In Second Workshop on Fine-Grained Visual Categorization, 2013.
- Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105, 2012.
- Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
- Chen-Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, and Zhuowen Tu. Deeply-supervised nets. In Artificial Intelligence and Statistics, pages 562–570, 2015.
- Zhichao Li, Yi Yang, Xiao Liu, Feng Zhou, Shilei Wen, and Wei Xu. Dynamic computational time for visual attention. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1199–1209, 2017.
- Tsung-Yu Lin and Subhransu Maji. Visualizing and understanding deep texture representations. In IEEE International Conference on Computer Vision (ICCV), pages 2791–2799, 2016.
- Tsung-Yu Lin, Aruni RoyChowdhury, and Subhransu Maji. Bilinear CNN models for fine-grained visual recognition. In IEEE International Conference on Computer Vision (ICCV), pages 1449–1457, 2015.
- Dong C Liu and Jorge Nocedal. On the limited memory bfgs method for large scale optimization. Mathematical programming, 45(1-3):503–528, 1989.
- David G Lowe. Object recognition from local scale-invariant features. In IEEE International Conference on Computer Vision, volume 2, pages 1150–1157.
- Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9(Nov):2579–2605, 2008.
- Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens van der Maaten. Exploring the limits of weakly supervised pretraining. In European Conference on Computer Vision (ECCV), pages 181–196, 2018.
- S. Maji, J. Kannala, E. Rahtu, M. Blaschko, and A. Vedaldi. Fine-grained visual classification of aircraft. Technical report, 2013.
- Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A simple neural attentive meta-learner. In International Conference on Learning Representations, 2018.
- Richard D. Morey. Confidence intervals from normalized data: A correction to cousineau (2005). Tutorials in Quantitative Methods for Psychology, 4(2):61–64, 2008.
- Romain Mormont, Pierre Geurts, and Raphaël Marée. Comparison of deep transfer learning strategies for digital pathology. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 2262–2271, 2018.
- Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In Computer Vision, Graphics & Image Processing, 2008. ICVGIP’08. Sixth Indian Conference on, pages 722–729. IEEE, 2008.
- Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3498–3505. IEEE, 2012.
- Yuxin Peng, Xiangteng He, and Junjie Zhao. Object-part attention model for fine-grained image classification. IEEE Transactions on Image Processing, 27(3):1487–1500, 2018.
- Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In International Conference on Machine Learning, 2016.
- Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. CNN features off-the-shelf: an astounding baseline for recognition. In IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 512–519. IEEE, 2014.
- Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
- Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, Dec 2015.
- Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018.
- Tyler Scott, Karl Ridgeway, and Michael C Mozer. Adapted deep embeddings: A synthesis of methods for k-shot inductive transfer learning. In Advances in Neural Information Processing Systems, pages 76–85, 2018.
- Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations, 2015.
- Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pages 4080–4090, 2017.
- Yang Song, Fan Zhang, Qing Li, Heng Huang, Lauren J O’Donnell, and Weidong Cai. Locally-transferred fisher vectors for texture classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4912–4920, 2017.
- Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.
- Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 843–852. IEEE, 2017.
- Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17), 2017.
- Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
- Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2826, 2016.
- Antonio Torralba and Alexei A Efros. Unbiased look at dataset bias. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1521–1528. IEEE, 2011.
- Twan van Laarhoven. L2 regularization versus batch and weight normalization. CoRR, abs/1706.05350, 2017.
- Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In Advances in Neural Information Processing Systems, pages 3630–3638, 2016.
- Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3485–3492. IEEE, 2010.
- Hantao Yao, Shiliang Zhang, Yongdong Zhang, Jintao Li, and Qi Tian. Coarse-to-fine description for fine-grained visual categorization. IEEE Transactions on Image Processing, 25(10):4858–4872, 2016.
- Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? In Advances in Neural Information Processing Systems, pages 3320–3328, 2014.
- Amir R Zamir, Alexander Sax, William Shen, Leonidas Guibas, Jitendra Malik, and Silvio Savarese. Taskonomy: Disentangling task transfer learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3712–3722, 2018.
- Guodong Zhang, Chaoqi Wang, Bowen Xu, and Roger Grosse. Three mechanisms of weight decay regularization. In International Conference on Learning Representations, 2019.
- Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
- Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 8697–8710, 2018.
Full Text
Tags
Comments