AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We observe two issues, training/test disparity and mode collapse, of network with stochastic architectures, which are ignored by previous works, and propose two novel approaches to address them

Understanding and Exploring the Network with Stochastic Architectures

NIPS 2020, (2020)

Cited by: 0|Views71
EI
Full Text
Bibtex
Weibo

Abstract

There is an emerging trend to train a network with stochastic architectures to enable various architectures to be plugged and played during inference. However, the existing investigation is highly entangled with neural architecture search (NAS), limiting its widespread use across scenarios. In this work, we decouple the training of a netw...More

Code:

Data:

Introduction
  • Deep neural networks (DNNs) are the de facto methods to model complex data in a wide spectrum of practical scenarios [12, 36, 38, 40].
  • Based on a trained NSA, the authors can predict for the validation data with diverse architectures seen or even unseen during training, due to its high compatibility with various architectures.
  • The authors calculate the test accuracy of 200 randomly sampled architectures based on the NSA-i models trained under various spaces.
Highlights
  • Deep neural networks (DNNs) are the de facto methods to model complex data in a wide spectrum of practical scenarios [12, 36, 38, 40]
  • Recent research even permits us to train a network without a fixed architecture [3, 45, 42, 1, 11], i.e., at every training iteration, an architecture sample is randomly drawn from an architecture distribution and used to guide the training of network weights, which is known as the weight sharing technique in neural architecture search (NAS)1
  • Though the weight-sharing network with stochastic architectures is promising, its usage is closely encoupled with NAS to relieve the burden of training thousands of networks
  • Given the potential of network with stochastic architectures (NSA) to unleash the predictive capacity of diverse architectures during inference, we apply NSA to a variety of tasks ranging from ensemble learning, uncertainty estimation, to semi-supervised learning, which is unexplored in previously works
  • We observe two issues, training/test disparity and mode collapse, of NSA, which are ignored by previous works, and propose two novel approaches to address them
  • We further provide valuable insights on how to train a NSA, hopefully benefiting NAS
Results
  • The authors report chitectures and unseen ones given the validation the average accuracy of the seen architectures and the accuracy, w.r.t. the training space size S.
  • These results validate the generalization capacity of NSA, perhaps because the shared weights learn common structures of the architectures.
  • Given the potential of NSA to unleash the predictive capacity of diverse architectures during inference, the authors apply NSA to a variety of tasks ranging from ensemble learning, uncertainty estimation, to semi-supervised learning, which is unexplored in previously works.
  • As shown in Sec. 4.1, there is evidence to suggest that ensemble the predictions from different architectures does boost the validation performance, consistent with the common knowledge [18], so the authors continue evaluating this technique on the more expressive NSA-id models with conditional BNs used.
  • The authors implement a further baseline: Average of individuals, in which the authors individually trains 5 networks with the 5 architectures used by NSA-id, and report their average results, to present the average performance of the used architectures, rather than their ensemble as comparing that to NSA-id is unfair given the need of 5× training costs.
  • The authors think this results from the fact that different architectures offer relatively diverse predictions in NSA-id, alleviating the over-confidence, while MC dropout is known to suffer from mode collapse [18], and cannot benefit too much from prediction ensemble.
  • The authors implement two baselines: (i) WRN-28-10† with Π model [17], which works the same as NSA-id expect for using dropout to provide twice predictions for one data; (ii) WRN-28-10† trained with only labeled data.
Conclusion
  • It is common to train a network with stochastic architectures to enable the evaluation of ample architectures given shared weights.
  • The authors observe two issues, training/test disparity and mode collapse, of NSA, which are ignored by previous works, and propose two novel approaches to address them.
  • This work manages to understand a wide range of properties of the network with stochastic architectures (NSA), and apply it to several challenging tasks to sufficiently exploit its potential.
Summary
  • Deep neural networks (DNNs) are the de facto methods to model complex data in a wide spectrum of practical scenarios [12, 36, 38, 40].
  • Based on a trained NSA, the authors can predict for the validation data with diverse architectures seen or even unseen during training, due to its high compatibility with various architectures.
  • The authors calculate the test accuracy of 200 randomly sampled architectures based on the NSA-i models trained under various spaces.
  • The authors report chitectures and unseen ones given the validation the average accuracy of the seen architectures and the accuracy, w.r.t. the training space size S.
  • These results validate the generalization capacity of NSA, perhaps because the shared weights learn common structures of the architectures.
  • Given the potential of NSA to unleash the predictive capacity of diverse architectures during inference, the authors apply NSA to a variety of tasks ranging from ensemble learning, uncertainty estimation, to semi-supervised learning, which is unexplored in previously works.
  • As shown in Sec. 4.1, there is evidence to suggest that ensemble the predictions from different architectures does boost the validation performance, consistent with the common knowledge [18], so the authors continue evaluating this technique on the more expressive NSA-id models with conditional BNs used.
  • The authors implement a further baseline: Average of individuals, in which the authors individually trains 5 networks with the 5 architectures used by NSA-id, and report their average results, to present the average performance of the used architectures, rather than their ensemble as comparing that to NSA-id is unfair given the need of 5× training costs.
  • The authors think this results from the fact that different architectures offer relatively diverse predictions in NSA-id, alleviating the over-confidence, while MC dropout is known to suffer from mode collapse [18], and cannot benefit too much from prediction ensemble.
  • The authors implement two baselines: (i) WRN-28-10† with Π model [17], which works the same as NSA-id expect for using dropout to provide twice predictions for one data; (ii) WRN-28-10† trained with only labeled data.
  • It is common to train a network with stochastic architectures to enable the evaluation of ample architectures given shared weights.
  • The authors observe two issues, training/test disparity and mode collapse, of NSA, which are ignored by previous works, and propose two novel approaches to address them.
  • This work manages to understand a wide range of properties of the network with stochastic architectures (NSA), and apply it to several challenging tasks to sufficiently exploit its potential.
Tables
  • Table1: The change of the AUC, which meathe ROC Curve (AUC). Thus, we report the AUCs of sures the differentiability between the seen arthe trained NSA-i models in Table 1. We also report chitectures and unseen ones given the validation the average accuracy of the seen architectures and the accuracy, w.r.t. the training space size S. We unseen ones for reference. Consistent with the his- also report the average accuracy of the seen and tograms, with the training space increases, it is harder unseen architecture for reference
  • Table2: Comparison of NSA-id, using ensemble of 5 different architectures for prediction, and a range of competing baseline, in terms of test error and ECE. ENAS and DARTS adpot the parameter-efficient separable convolutions and apply re-training to get the results
  • Table3: Comparison between NSA-id and MC dropout in terms of the quality of uncertainty estimates. PGDa-b-c represents the PGD adversary with perturbation budget a/255, number of steps b, and step size c/255
Download tables as Excel
Related work
  • Randomizing certain parts of DNNs is usually indispensable to prevent the trained model from overfitting, over-confident, and co-adaptation [35, 39, 19, 8, 21]. But existing stochastic regularizations are commonly applied locally upon the network weights or the hidden feature maps, which is argued to be less effective than globally regularizing the behaviour of the model [4], as done in NSA. Besides, these stochastic regularizations are usually turned off in inference phase, while NSA predicts with stochastic architectures and benefits from such stochasticity. A more principled approach to include stochasticity is Bayesian Neural Networks (BNNs) [24, 28, 9, 2, 23, 7], which place uncertainty upon network weights and measure predictive uncertainty given Bayesian theorem. But BNNs are known to suffer from mode collapse [18] and training challenges [50], hence not popularly used in practice.

    In neural architecture search (NAS) [52, 53, 31, 30, 22, 45, 3, 44, 37], tremendous efforts have been devoted to discovering performant architectures in a broad yet structured architecture space. For computationally feasible search, it is common to train a network with stochastic architectures to enable the evaluation of ample architectures given shared weights. But almost all NAS works ignore to analyze the properties of such a network, e.g., the convergence, training stability, and generalization to unseen architectures, which are of central importance in NAS. In this work, we uncover these un-identified aspects, and provide novel insights for how to train a better NSA in NAS.
Funding
  • This work was supported by the National Key Research and Development Program of China (No.2017YFA0700904), NSFC Projects (Nos. 61620106010, U19B2034, U1811461), Beijing Academy of Artificial Intelligence (BAAI), Tsinghua-Huawei Joint Research Program, a grant from Tsinghua Institute for Guo Qiang, Tiangong Institute for Intelligent Computing, and the NVIDIA NVAIL Program with GPU/DGX Acceleration
Study subjects and analysis
aforementioned individuals: 5
The uncertainty minimization. 6As a note, the ensemble of the 5 aforementioned individuals yields striking 2.36% error rate on CIFAR-10, confirming that weight sharing is a main cause of mode collapse. PGD1-2-1

Reference
  • Gabriel Bender, Pieter-Jan Kindermans, Barret Zoph, Vijay Vasudevan, and Quoc Le. Understanding and simplifying one-shot architecture search. In International Conference on Machine Learning, pages 550–559, 2018.
    Google ScholarLocate open access versionFindings
  • Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424, 2015.
    Findings
  • Han Cai, Ligeng Zhu, and Song Han. Proxylessnas: Direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332, 2018.
    Findings
  • Zhijie Deng, Yucen Luo, Jun Zhu, and Bo Zhang. Dbsn: Measuring uncertainty through bayesian learning of deep neural network structures. arXiv preprint arXiv:1911.09804, 2019.
    Findings
  • Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
    Findings
  • Paul Erdos and Alfréd Rényi. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci, 5(1): 17–60, 1960.
    Google ScholarLocate open access versionFindings
  • Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059, 2016.
    Google ScholarLocate open access versionFindings
  • Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. Dropblock: A regularization method for convolutional networks. In Advances in Neural Information Processing Systems, pages 10727–10737, 2018.
    Google ScholarLocate open access versionFindings
  • Alex Graves. Practical variational inference for neural networks. In Advances in neural information processing systems, pages 2348–2356, 2011.
    Google ScholarLocate open access versionFindings
  • Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1321–1330. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, Zechun Liu, Yichen Wei, and Jian Sun. Single path one-shot neural architecture search with uniform sampling. arXiv preprint arXiv:1904.00420, 2019.
    Findings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
    Google ScholarLocate open access versionFindings
  • Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
    Google ScholarLocate open access versionFindings
  • Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
    Google ScholarLocate open access versionFindings
  • Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
    Findings
  • Alex Krizhevsky et al. Learning multiple layers of features from tiny images. 2009.
    Google ScholarFindings
  • Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242, 2016.
    Findings
  • Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in neural information processing systems, pages 6402–6413, 2017.
    Google ScholarLocate open access versionFindings
  • Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Fractalnet: Ultra-deep neural networks without residuals. arXiv preprint arXiv:1605.07648, 2016.
    Findings
  • Stefan Lee, Senthil Purushwalkam, Michael Cogswell, David Crandall, and Dhruv Batra. Why m heads are better than one: Training a diverse ensemble of deep networks. arXiv preprint arXiv:1511.06314, 2015.
    Findings
  • Senwei Liang, Yuehaw Khoo, and Haizhao Yang. Drop-activation: Implicit parameter reduction and harmonic regularization. arXiv preprint arXiv:1811.05850, 2018.
    Findings
  • Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055, 2018.
    Findings
  • Christos Louizos and Max Welling. Multiplicative normalizing flows for variational bayesian neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2218–2227. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • David JC MacKay. Bayesian methods for adaptive models. PhD thesis, California Institute of Technology, 1992.
    Google ScholarFindings
  • Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations (ICLR), 2018.
    Google ScholarLocate open access versionFindings
  • Takeru Miyato and Masanori Koyama. cgans with projection discriminator. arXiv preprint arXiv:1802.05637, 2018.
    Findings
  • Jerome L Myers, Arnold Well, and Robert Frederick Lorch. Research design and statistical analysis. Routledge, 2010.
    Google ScholarFindings
  • Radford M Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012.
    Google ScholarFindings
  • Tianyu Pang, Kun Xu, Chao Du, Ning Chen, and Jun Zhu. Improving adversarial robustness via promoting ensemble diversity. arXiv preprint arXiv:1901.08846, 2019.
    Findings
  • Hieu Pham, Melody Y Guan, Barret Zoph, Quoc V Le, and Jeff Dean. Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268, 2018.
    Findings
  • Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Jie Tan, Quoc V Le, and Alexey Kurakin. Large-scale evolution of image classifiers. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2902–2911. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018.
    Google ScholarLocate open access versionFindings
  • Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
    Findings
  • Lewis Smith and Yarin Gal. Understanding measures of uncertainty for adversarial example detection. arXiv preprint arXiv:1803.08533, 2018.
    Findings
  • Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1): 1929–1958, 2014.
    Google ScholarLocate open access versionFindings
  • Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
    Google ScholarLocate open access versionFindings
  • Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V Le. Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2820–2828, 2019.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
    Google ScholarLocate open access versionFindings
  • Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of neural networks using dropconnect. In International conference on machine learning, pages 1058–1066, 2013.
    Google ScholarLocate open access versionFindings
  • Hao Wang, Naiyan Wang, and Dit-Yan Yeung. Collaborative deep learning for recommender systems. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pages 1235–1244, 2015.
    Google ScholarLocate open access versionFindings
  • Andrew Gordon Wilson and Pavel Izmailov. Bayesian deep learning and a probabilistic perspective of generalization. arXiv preprint arXiv:2002.08791, 2020.
    Findings
  • Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10734–10742, 2019.
    Google ScholarLocate open access versionFindings
  • Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500, 2017.
    Google ScholarLocate open access versionFindings
  • Saining Xie, Alexander Kirillov, Ross Girshick, and Kaiming He. Exploring randomly wired neural networks for image recognition. In Proceedings of the IEEE International Conference on Computer Vision, pages 1284–1293, 2019.
    Google ScholarLocate open access versionFindings
  • Sirui Xie, Hehui Zheng, Chunxiao Liu, and Liang Lin. Snas: stochastic neural architecture search. arXiv preprint arXiv:1812.09926, 2018.
    Findings
  • Jiahui Yu and Thomas S. Huang. Universally slimmable networks and improved training techniques. In 2019 IEEE/CVF International Conference on Computer Vision, pages 1803–1811. IEEE.
    Google ScholarLocate open access versionFindings
  • Jiahui Yu, Linjie Yang, Ning Xu, Jianchao Yang, and Thomas S. Huang. Slimmable neural networks. In 7th International Conference on Learning Representations. OpenReview.net, 2019.
    Google ScholarLocate open access versionFindings
  • Kun Yuan, Quanquan Li, Yucong Zhou, Jing Shao, and Junjie Yan. Diving into optimization of topology in neural networks, 2020. URL https://openreview.net/forum?id=HyetFnEFDS.
    Findings
  • Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
    Findings
  • Guodong Zhang, Shengyang Sun, David Duvenaud, and Roger Grosse. Noisy natural gradient as variational inference. arXiv preprint arXiv:1712.02390, 2017.
    Findings
  • Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6848–6856, 2018.
    Google ScholarLocate open access versionFindings
  • Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
    Findings
  • Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8697–8710, 2018.
    Google ScholarLocate open access versionFindings
Author
Zhijie Deng
Zhijie Deng
Shifeng Zhang
Shifeng Zhang
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科