AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
We presented a novel paradigm for neural architecture search by training a single-stage model, from which high-quality child models of different sizes can be induced for instant deployment without retraining or finetuning

BigNAS: Scaling Up Neural Architecture Search with Big Single-Stage Models

european conference on computer vision, pp.702-717, (2020)

被引用26|浏览194
下载 PDF 全文
引用
微博一下

摘要

Neural architecture search (NAS) has shown promising results discovering models that are both accurate and fast. For NAS, training a one-shot model has become a popular strategy to rank the relative quality of different architectures (child models) using a single set of shared weights. However, while one-shot model weights can effective...更多

代码

数据

0
简介
  • Designing network architectures that are both accurate and efficient is crucial for deep learning on edge devices.
  • While early NAS methods were prohibitively expensive for most practitioners, recent efficient NAS methods based on weight sharing reduce search costs by orders of magnitude [2,21,24,34]
  • These methods work by training a super-network and identifying a path through the network – a subset of its operations – which gives the best possible accuracy while satisfying a user-specified latency constraint for a specific hardware device.
  • The advantage of this approach is that the authors can train the super-network and use it to rank many different candidate architectures from a user-defined search space
重点内容
  • Designing network architectures that are both accurate and efficient is crucial for deep learning on edge devices
  • While early Neural Architecture Search (NAS) methods were prohibitively expensive for most practitioners, recent efficient NAS methods based on weight sharing reduce search costs by orders of magnitude [2,21,24,34]
  • We propose several techniques to bridge the gap between the distinct initialization and learning dynamics across small and big child models with shared parameters
  • Our preliminary results show that this leads to ∼ 0.3% improvement on average top-1 accuracy for child models compared with sampling different patches
  • Following previous resource-aware NAS methods [5, 15, 31,32,33], our network architectures consist of a stack with inverted bottleneck residual blocks (MBConv) [28]
  • For small-sized models, our BigNASModel-S achieves 76.5% accuracy under only 240 MFLOPs, which is 1.3% better than MobileNetV3 in terms of similar FLOPs, and 0.5% better than ResNet-50 [11] with 17 × fewer FLOPs
  • We presented a novel paradigm for neural architecture search by training a single-stage model, from which high-quality child models of different sizes can be induced for instant deployment without retraining or finetuning
方法
  • The authors first present the details of the search space, followed by the main results compared with the previous state-of-the-arts in terms of both accuracy and efficiency.
  • For weight sharing on the kernel size dimension in the inverted residual blocks, a 3 × 3 depthwise kernel is defined to be the center of a 5 × 5 depthwise kernel.
  • Both kernel sizes and channel numbers can be adjusted layer-wise.
  • The input resolution is network-wise and the number of layers is a stage-wise configuration in the search space
结果
  • 400M FLOPs 600M FLOPs 1000M FLOPs Model Family.
  • MobileNetV1 0.5⇥ MobileNetV2 0.75⇥ AutoSlim-MobileNetV2 MobileNetV3 1.0⇥ MNasNet A1 Once-For-All Once-For-All finetuned BigNASModel-S.
  • NASNet B MobileNetV2 1.3⇥ MobileNetV3 1.25⇥ MNasNet A3 E cientNet B0 BigNASModel-M.
  • MobileNetV1 1.0⇥ NASNet A DARTS E cientNet B1 BigNASModel-L.
结论
  • The authors presented a novel paradigm for neural architecture search by training a single-stage model, from which high-quality child models of different sizes can be induced for instant deployment without retraining or finetuning.
  • The authors obtain a family of BigNASModels as slices in a big pretrained single-stage model.
  • These slices simultaneously surpass all state-of-theart ImageNet classification models ranging from 200 MFLOPs to 1 GFLOPs. The authors hope the work can serve to further simplify and scale up neural architecture search
总结
  • Introduction:

    Designing network architectures that are both accurate and efficient is crucial for deep learning on edge devices.
  • While early NAS methods were prohibitively expensive for most practitioners, recent efficient NAS methods based on weight sharing reduce search costs by orders of magnitude [2,21,24,34]
  • These methods work by training a super-network and identifying a path through the network – a subset of its operations – which gives the best possible accuracy while satisfying a user-specified latency constraint for a specific hardware device.
  • The advantage of this approach is that the authors can train the super-network and use it to rank many different candidate architectures from a user-defined search space
  • Methods:

    The authors first present the details of the search space, followed by the main results compared with the previous state-of-the-arts in terms of both accuracy and efficiency.
  • For weight sharing on the kernel size dimension in the inverted residual blocks, a 3 × 3 depthwise kernel is defined to be the center of a 5 × 5 depthwise kernel.
  • Both kernel sizes and channel numbers can be adjusted layer-wise.
  • The input resolution is network-wise and the number of layers is a stage-wise configuration in the search space
  • Results:

    400M FLOPs 600M FLOPs 1000M FLOPs Model Family.
  • MobileNetV1 0.5⇥ MobileNetV2 0.75⇥ AutoSlim-MobileNetV2 MobileNetV3 1.0⇥ MNasNet A1 Once-For-All Once-For-All finetuned BigNASModel-S.
  • NASNet B MobileNetV2 1.3⇥ MobileNetV3 1.25⇥ MNasNet A3 E cientNet B0 BigNASModel-M.
  • MobileNetV1 1.0⇥ NASNet A DARTS E cientNet B1 BigNASModel-L.
  • Conclusion:

    The authors presented a novel paradigm for neural architecture search by training a single-stage model, from which high-quality child models of different sizes can be induced for instant deployment without retraining or finetuning.
  • The authors obtain a family of BigNASModels as slices in a big pretrained single-stage model.
  • These slices simultaneously surpass all state-of-theart ImageNet classification models ranging from 200 MFLOPs to 1 GFLOPs. The authors hope the work can serve to further simplify and scale up neural architecture search
表格
  • Table1: MobileNetV2-based search space
  • Table2: Analysis on Child Models sampled from BigNASModel. We compare the ImageNet validation performance of (1) child model directly sampled from BigNASModel without finetuning (w/o Finetuning), (2) child model finetuned with various constant learning rate (w/ Finetuning at different lr). Blue subscript indicates the performance improvement while Red subscript indicates degradation
  • Table3: Analysis on training child architectures from scratch. We compare the ImageNet validation performance of (1) child model directly sampled from BigNASModel without finetuning (w/o Finetuning), (2) child architectures trained from scratch without distillation (FromScratch w/o distill), and (3) child architectures trained from scratch with two distillation methods A [<a class="ref-link" id="c14" href="#r14">14</a>] and B [<a class="ref-link" id="c35" href="#r35">35</a>] (FromScratch w/ distill (A)/(B))
Download tables as Excel
相关工作
  • Earlier NAS methods [19,20,26,37,38] train thousands of candidate architectures from scratch (on a smaller proxy task) and use their validation performance as the feedback to an algorithm that learns to focus on the most promising regions in the search space. More recent works have sought to amortize the cost by training a single over-parameterized one-shot model. Each architecture in the search space uses only a subset of the operations in the one-shot model; these child models can be efficiently ranked by using the shared weights to estimate their relative accuracies [2, 3, 5, 21, 24, 33, 34].

    As a complementary direction, resource-aware NAS methods are proposed to simultaneously maximize prediction accuracy and minimize resource requirements such as latency, FLOPs, or memory footprints [4, 9, 30, 31, 33, 34].

    All the aforementioned approaches require two-stage training: Once the best architectures have been identified (either through the proxy tasks or using a one-shot model), they have to be retrained from scratch to obtain a final model with higher accuracy. In most of these existing works, a single search experiment only targets a single resource budget or a narrow range of resource budgets at a time.
基金
  • Our discovered model family, BigNASModels, achieve top-1 accuracies ranging from 76.5% to 80.9%, surpassing state-of-the-art models in this range including EfficientNets and Once-for-All networks without extra retraining or post-processing
  • With the proposed techniques, we are able train a high-quality single-stage model on ImageNet and obtain a family of child models that simultaneously surpass all the state-of-the-art models in the range of 200 to 1000 MFLOPs, including EfficientNets B0-B2 (1.6% more accurate under 400 MFLOPs), without retraining or finetuning the child models upon the completion of search
  • One of our child models achieves 80.9% top-1 accuracy at 1G FLOPs (four times less computation than a ResNet50)
  • Our preliminary results show that this leads to ∼ 0.3% improvement on average top-1 accuracy for child models compared with sampling different patches
  • The training started to work when we reduced the learning rate to 30% of its original value, but this configuration lead to much worse results (∼ 1.0% top-1 accuracy drop on ImageNet)
  • For small-sized models, our BigNASModel-S achieves 76.5% accuracy under only 240 MFLOPs, which is 1.3% better than MobileNetV3 in terms of similar FLOPs, and 0.5% better than ResNet-50 [11] with 17 × fewer FLOPs
  • For medium-sized models, our BigNASModel-M achieves 1.6% better accuracy than EfficientNet B0
  • For large-sized models, even when ImageNet classification accuracy saturates, our BigNASModel-L still has 0.6% improvement compared with EfficientNet B2
  • The single-stage model is able to converge when we reduce the learning rate to the 30% of its original value
  • In comparison, finetuning a Once-for-All child model for 25 epochs improves the top-1 accuracy from 64.0% to 64.4% [4]
研究对象与分析
data: 2
Sandwich Rule. In each training step, given a mini-batch of data, the sandwich rule [35] samples the smallest child model, the biggest (full) child model and N randomly sampled child models (N = 2 in our experiments). It then aggregates the gradients from all sampled child models before updating the weights of the single-stage model

引用论文
  • Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mane, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viegas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., Zheng, X.: TensorFlow: Large-scale machine learning on heterogeneous systems (2015), https://www.tensorflow.org/, software available from tensorflow.org
    Locate open access versionFindings
  • Bender, G., Kindermans, P.J., Zoph, B., Vasudevan, V., Le, Q.: Understanding and simplifying one-shot architecture search. In: International Conference on Machine Learning. pp. 549–558 (2018)
    Google ScholarLocate open access versionFindings
  • Brock, A., Lim, T., Ritchie, J., Weston, N.: SMASH: One-shot model architecture search through hypernetworks. In: International Conference on Learning Representations (2018), https://openreview.net/forum?id=rydeCEhs-
    Locate open access versionFindings
  • Cai, H., Gan, C., Han, S.: Once for all: Train one network and specialize it for efficient deployment. arXiv preprint arXiv:1908.09791 (2019)
    Findings
  • Cai, H., Zhu, L., Han, S.: Proxylessnas: Direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332 (2018)
    Findings
  • Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V., Le, Q.V.: Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501 (2018)
    Findings
  • Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. pp. 248–255.
    Google ScholarLocate open access versionFindings
  • Goyal, P., Dollar, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., He, K.: Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017)
    Findings
  • Guo, Z., Zhang, X., Mu, H., Heng, W., Liu, Z., Wei, Y., Sun, J.: Single path one-shot neural architecture search with uniform sampling. arXiv preprint arXiv:1904.00420 (2019)
    Findings
  • He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. In: Proceedings of the IEEE international conference on computer vision. pp. 1026–1034 (2015)
    Google ScholarLocate open access versionFindings
  • He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
    Google ScholarLocate open access versionFindings
  • He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: European conference on computer vision. pp. 630–645. Springer (2016)
    Google ScholarLocate open access versionFindings
  • He, T., Zhang, Z., Zhang, H., Zhang, Z., Xie, J., Li, M.: Bag of tricks for image classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 558–567 (2019)
    Google ScholarLocate open access versionFindings
  • Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
    Findings
  • Howard, A., Sandler, M., Chu, G., Chen, L.C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., et al.: Searching for mobilenetv3. arXiv preprint arXiv:1905.02244 (2019)
    Findings
  • Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)
    Findings
  • Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7132–7141 (2018)
    Google ScholarLocate open access versionFindings
  • Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)
    Findings
  • Liu, C., Zoph, B., Neumann, M., Shlens, J., Hua, W., Li, L.J., Fei-Fei, L., Yuille, A., Huang, J., Murphy, K.: Progressive neural architecture search. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 19–34 (2018)
    Google ScholarLocate open access versionFindings
  • Liu, H., Simonyan, K., Vinyals, O., Fernando, C., Kavukcuoglu, K.: Hierarchical representations for efficient architecture search. arXiv preprint arXiv:1711.00436 (2017)
    Findings
  • Liu, H., Simonyan, K., Yang, Y.: Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055 (2018)
    Findings
  • Loshchilov, I., Hutter, F.: Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016)
    Findings
  • Ma, N., Zhang, X., Zheng, H.T., Sun, J.: Shufflenet v2: Practical guidelines for efficient cnn architecture design. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 116–131 (2018)
    Google ScholarLocate open access versionFindings
  • Pham, H., Guan, M.Y., Zoph, B., Le, Q.V., Dean, J.: Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268 (2018)
    Findings
  • Ramachandran, P., Zoph, B., Le, Q.V.: Searching for activation functions. arXiv preprint arXiv:1710.05941 (2017)
    Findings
  • Real, E., Aggarwal, A., Huang, Y., Le, Q.V.: Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548 (2018)
    Findings
  • list anonymized for double-blind review, A.: Can weight sharing outperform random architecture search? an investigation with tunas. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2020), previous title was Efficient Neural Architecture Search: A View from the Trenches. To appear in CVPR 2020
    Google ScholarLocate open access versionFindings
  • Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation. arXiv preprint arXiv:1801.04381 (2018)
    Findings
  • Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1), 1929–1958 (2014)
    Google ScholarLocate open access versionFindings
  • Stamoulis, D., Ding, R., Wang, D., Lymberopoulos, D., Priyantha, B., Liu, J., Marculescu, D.: Single-path nas: Designing hardware-efficient convnets in less than 4 hours. arXiv preprint arXiv:1904.02877 (2019)
    Findings
  • Tan, M., Chen, B., Pang, R., Vasudevan, V., Le, Q.V.: Mnasnet: Platform-aware neural architecture search for mobile. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
    Google ScholarLocate open access versionFindings
  • Tan, M., Le, Q.V.: Efficientnet: Rethinking model scaling for convolutional neural networks. Proceedings of the International Conference on Machine Learning (ICML) (2019)
    Google ScholarLocate open access versionFindings
  • Wu, B., Dai, X., Zhang, P., Wang, Y., Sun, F., Wu, Y., Tian, Y., Vajda, P., Jia, Y., Keutzer, K.: Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 10734–10742 (2019)
    Google ScholarLocate open access versionFindings
  • Yu, J., Huang, T.: Network slimming by slimmable networks: Towards one-shot architecture search for channel numbers. arXiv preprint arXiv:1903.11728 (2019)
    Findings
  • Yu, J., Huang, T.: Universally slimmable networks and improved training techniques. arXiv preprint arXiv:1903.05134 (2019)
    Findings
  • Yu, J., Yang, L., Xu, N., Yang, J., Huang, T.: Slimmable neural networks. arXiv preprint arXiv:1812.08928 (2018)
    Findings
  • Zoph, B., Le, Q.V.: Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578 (2016)
    Findings
  • Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V.: Learning transferable architectures for scalable image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 8697–8710 (2018)
    Google ScholarLocate open access versionFindings
您的评分 :
0

 

标签
评论
小科