# Progressive Neural Architecture Search

european conference on computer vision, 2018.

EI

Weibo:

Abstract:

We propose a new method for learning the structure of convolutional neural networks (CNNs) that is more efficient than recent state-of-the-art methods based on reinforcement learning and evolutionary algorithms. Our approach uses a sequential model-based optimization (SMBO) strategy, in which we search for structures in order of increasin...More

Code:

Data:

Introduction

- There has been a lot of recent interest in automatically learning good neural net architectures.
- When using evolutionary algorithms (EA), each neural network structure is encoded as a string, and random mutations and recombinations of the strings are performed during the search process; each string is trained and evaluated on a validation set, and the top performing models generate “children”.
- When using reinforcement learning (RL), the agent performs a sequence of actions, which specifies the structure of the model; this model is trained and its validation performance is returned as the reward, which is used to update the RNN.
- Electronic supplementary material The online version of this chapter contains supplementary material, which is available to authorized users

Highlights

- There has been a lot of recent interest in automatically learning good neural net architectures
- When using evolutionary algorithms (EA), each neural network structure is encoded as a string, and random mutations and recombinations of the strings are performed during the search process; each string is trained and evaluated on a validation set, and the top performing models generate “children”
- When using reinforcement learning (RL), the agent performs a sequence of actions, which specifies the structure of the model; this model is trained and its validation performance is returned as the reward, which is used to update the recurrent neural network
- Let PNASNet-5 denote the best convolutional neural networks we discovered on CIFAR using Progressive Neural Architecture Search, visualized in Fig. 1
- After we have selected the cell structure, we try various N and F values such that the number of model parameters is around 3M, train them each for 300 epochs using initial learning rate of 0.025 with cosine decay, and pick the best combination based on the validation set
- There are many possible directions for future work, including: the use of better surrogate predictors, such as Gaussian processes with string kernels; the use of model-based early stopping, such as [3], so we can stop the training of “unpromising” models before reaching E1 epochs; the use of “warm starting”, to initialize the training of a larger b + 1-sized model from its smaller parent; the use of Bayesian optimization, in which we use an acquisition function, such as expected improvement or upper confidence bound, to rank the candidate models, rather than greedily picking the top K; adaptively varying the number of models K evaluated at each step; the automatic exploration of speed-accuracy tradeoffs, etc

Methods

- 4.1 Progressive Neural Architecture Search

Many previous approaches directly search in the space of full cells, or worse, full CNNs. - In [35] a fixed-length binary string encoding of CNN architecture is defined and used in model evolution/mutation
- While this is a more direct approach, the authors argue that it is difficult to directly navigate in an exponentially large search space, especially at the beginning where there is no knowledge of what makes a good model.
- (The authors' predictor takes negligible time to train and apply.) The authors use the predictor to evaluate all the candidate cells, and pick the K most promising ones
- The authors add these to the queue, and repeat the process, until the authors find cells with a sufficient number B of blocks.
- See Algorithm 1 for the pseudocode, and Fig. 2 for an illustration

Results

- 5.1 Experimental Details

The authors' experimental setting follows [41]. In particular, the authors conduct most of the experiments on CIFAR-10 [19]. - After the authors have selected the cell structure, the authors try various N and F values such that the number of model parameters is around 3M, train them each for 300 epochs using initial learning rate of 0.025 with cosine decay, and pick the best combination based on the validation set.
- Input image size is 331 × 331
- In both experiments the authors use RMSProp optimizer, label smoothing of 0.1, auxiliary classifier located at 2/3 of the maximum depth weighted by 0.4, weight decay of 4e-5, and dropout of 0.5 in the final softmax layer.

Conclusion

- The main contribution of this work is to show how the authors can accelerate the search for good CNN structures by using progressive search through the space of increasingly complex graphs, combined with a learned prediction function to efficiently identify the most promising models to explore.
- There are many possible directions for future work, including: the use of better surrogate predictors, such as Gaussian processes with string kernels; the use of model-based early stopping, such as [3], so the authors can stop the training of “unpromising” models before reaching E1 epochs; the use of “warm starting”, to initialize the training of a larger b + 1-sized model from its smaller parent; the use of Bayesian optimization, in which the authors use an acquisition function, such as expected improvement or upper confidence bound, to rank the candidate models, rather than greedily picking the top K; adaptively varying the number of models K evaluated at each step; the automatic exploration of speed-accuracy tradeoffs, etc

Summary

## Introduction:

There has been a lot of recent interest in automatically learning good neural net architectures.- When using evolutionary algorithms (EA), each neural network structure is encoded as a string, and random mutations and recombinations of the strings are performed during the search process; each string is trained and evaluated on a validation set, and the top performing models generate “children”.
- When using reinforcement learning (RL), the agent performs a sequence of actions, which specifies the structure of the model; this model is trained and its validation performance is returned as the reward, which is used to update the RNN.
- Electronic supplementary material The online version of this chapter contains supplementary material, which is available to authorized users
## Methods:

4.1 Progressive Neural Architecture Search

Many previous approaches directly search in the space of full cells, or worse, full CNNs.- In [35] a fixed-length binary string encoding of CNN architecture is defined and used in model evolution/mutation
- While this is a more direct approach, the authors argue that it is difficult to directly navigate in an exponentially large search space, especially at the beginning where there is no knowledge of what makes a good model.
- (The authors' predictor takes negligible time to train and apply.) The authors use the predictor to evaluate all the candidate cells, and pick the K most promising ones
- The authors add these to the queue, and repeat the process, until the authors find cells with a sufficient number B of blocks.
- See Algorithm 1 for the pseudocode, and Fig. 2 for an illustration
## Results:

5.1 Experimental Details

The authors' experimental setting follows [41]. In particular, the authors conduct most of the experiments on CIFAR-10 [19].- After the authors have selected the cell structure, the authors try various N and F values such that the number of model parameters is around 3M, train them each for 300 epochs using initial learning rate of 0.025 with cosine decay, and pick the best combination based on the validation set.
- Input image size is 331 × 331
- In both experiments the authors use RMSProp optimizer, label smoothing of 0.1, auxiliary classifier located at 2/3 of the maximum depth weighted by 0.4, weight decay of 4e-5, and dropout of 0.5 in the final softmax layer.
## Conclusion:

The main contribution of this work is to show how the authors can accelerate the search for good CNN structures by using progressive search through the space of increasingly complex graphs, combined with a learned prediction function to efficiently identify the most promising models to explore.- There are many possible directions for future work, including: the use of better surrogate predictors, such as Gaussian processes with string kernels; the use of model-based early stopping, such as [3], so the authors can stop the training of “unpromising” models before reaching E1 epochs; the use of “warm starting”, to initialize the training of a larger b + 1-sized model from its smaller parent; the use of Bayesian optimization, in which the authors use an acquisition function, such as expected improvement or upper confidence bound, to rank the candidate models, rather than greedily picking the top K; adaptively varying the number of models K evaluated at each step; the automatic exploration of speed-accuracy tradeoffs, etc

- Table1: Spearman rank correlations of different predictors on the training set, ρb, and when extrapolating to unseen larger models, ρb+1. See text for details
- Table2: Relative efficiency of PNAS (using MLP-ensemble predictor) and NAS under the same search space. B is the size of the cell, “Top” is the number of top models we pick, “Accuracy” is their average validation accuracy, “# PNAS” is the number of models evaluated by PNAS, “# NAS” is the number of models evaluated by NAS to achieve the desired accuracy. Speedup measured by number of examples is greater than speedup in terms of number of models, because NAS has an additional reranking stage, that trains the top 250 models for 300 epochs each before picking the best one. We see that PNAS is up to 5 times faster in terms of the number of models it trains and evaluates
- Table3: Performance of different CNNs on CIFAR test set. All model comparisons employ a comparable number of parameters and exclude cutout data augmentation [<a class="ref-link" id="c9" href="#r9">9</a>]. “Error” is the top-1 misclassification rate on the CIFAR-10 test set. (Error rates have the form μ ± σ, where μ is the average over multiple trials and σ is the standard deviation. In PNAS we use 15 trials.) “Params” is the number of model parameters. “Cost” is the total number of examples processed through SGD (M1E1 + M2E2) before the architecture search terminates. The number of filters F for NASNet-{B, C} cannot be determined (hence N/A), and the actual E1, E2 may be larger than the values in this table (hence the range in cost), according to the original authors
- Table4: ImageNet classification results in the Mobile setting
- Table5: ImageNet classification results in the Large setting

Related work

- Our paper is based on the “neural architecture search” (NAS) method proposed in [40,41]. In the original paper [40], they use the REINFORCE algorithm [34] to estimate the parameters of a recurrent neural network (RNN), which represents a policy to generate a sequence of symbols (actions) specifying the structure of the CNN; the reward function is the classification accuracy on the validation set of a CNN generated from this sequence. [41] extended this by using a more structured search space, in which the CNN was defined in terms of a series of stacked “cells”. (They also replaced REINFORCE with proximal policy optimization (PPO) [29].) This method was able to learn CNNs which outperformed almost all previous methods in terms of accuracy vs speed on image classification (using CIFAR-10 [19] and ImageNet [8]) and object detection (using COCO [20]).

There are several other papers that use RL to learn network structures. [39] use the same model search space as NAS, but replace policy gradient with Qlearning. [2] also use Q-learning, but without exploiting cell structure. [5] use policy gradient to train an RNN, but the actions are now to widen an existing layer, or to deepen the network by adding an extra layer. This requires specifying an initial model and then gradually learning how to transform it. The same approach, of applying “network morphisms” to modify a network, was used in [12], but in the context of hill climbing search, rather than RL. [26] use parameter sharing among child models to substantially accelerate the search process.

An alternative to RL is to use evolutionary algorithms (EA; “neuroevolution” [32]). Early work (e.g., [33]) used EA to learn both the structure and the parameters of the network, but more recent methods, such as [21,24,27,28,35], just use EA to search the structures, and use SGD to estimate the parameters.

Reference

- Baisero, A., Pokorny, F.T., Ek, C.H.: On a family of decomposable kernels on sequences. CoRR abs/1501.06284 (2015)
- Baker, B., Gupta, O., Naik, N., Raskar, R.: Designing neural network architectures using reinforcement learning. In: ICLR (2017)
- Baker, B., Gupta, O., Raskar, R., Naik, N.: Accelerating neural architecture search using performance prediction. CoRR abs/1705.10823 (2017)
- Brock, A., Lim, T., Ritchie, J.M., Weston, N.: SMASH: one-shot model architecture search through hypernetworks. In: ICLR (2018)
- Cai, H., Chen, T., Zhang, W., Yu, Y., Wang, J.: Efficient architecture search by network transformation. In: AAAI (2018)
- Chen, Y., Li, J., Xiao, H., Jin, X., Yan, S., Feng, J.: Dual path networks. In: NIPS (2017)
- Cortes, C., Gonzalvo, X., Kuznetsov, V., Mohri, M., Yang, S.: AdaNet: adaptive structural learning of artificial neural networks. In: ICML (2017)
- Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: CVPR (2009)
- Devries, T., Taylor, G.W.: Improved regularization of convolutional neural networks with cutout. CoRR abs/1708.04552 (2017)
- Domhan, T., Springenberg, J.T., Hutter, F.: Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. IJCAI (2015)
- Dong, J.D., Cheng, A.C., Juan, D.C., Wei, W., Sun, M.: PPP-Net: platform-aware progressive search for pareto-optimal neural architectures. In: ICLR Workshop (2018)
- Elsken, T., Metzen, J.H., Hutter, F.: Simple and efficient architecture search for convolutional neural networks. CoRR abs/1711.04528 (2017)
- Grosse, R.B., Salakhutdinov, R., Freeman, W.T., Tenenbaum, J.B.: Exploiting compositionality to explore a large space of model structures. In: UAI (2012)
- Howard, A.G., et al.: MobileNets: efficient convolutional neural networks for mobile vision applications. CoRR abs/1704.04861 (2017)
- Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. CoRR abs/1709.01507 (2017)
- Huang, F., Ash, J.T., Langford, J., Schapire, R.E.: Learning deep resnet blocks sequentially using boosting theory. CoRR abs/1706.04964 (2017)
- Hutter, F., Hoos, H.H., Leyton-Brown, K.: Sequential model-based optimization for general algorithm configuration. In: Coello, C.A.C. (ed.) LION 2011. LNCS, vol. 6683, pp. 507–523. Springer, Heidelberg (2011). https://doi.org/10.1007/9783-642-25566-3 40
- Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
- Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Technical report, University of Toronto (2009)
- Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1 48 21.
- Liu, H., Simonyan, K., Vinyals, O., Fernando, C., Kavukcuoglu, K.: Hierarchical representations for efficient architecture search. In: ICLR (2018)
- Loshchilov, I., Hutter, F.: SGDR: stochastic gradient descent with restarts. In: ICLR (2017)
- Mendoza, H., Klein, A., Feurer, M., Springenberg, J.T., Hutter, F.: Towards Automatically-Tuned neural networks. In: ICML Workshop on AutoML, pp. 58–65, December 2016 24. Miikkulainen, R., et al.: Evolving deep neural networks. CoRR abs/1703.00548 (2017)
- 25. Negrinho, R., Gordon, G.J.: DeepArchitect: automatically designing and training deep architectures. CoRR abs/1704.08792 (2017)
- 26. Pham, H., Guan, M.Y., Zoph, B., Le, Q.V., Dean, J.: Efficient neural architecture search via parameter sharing. CoRR abs/1802.03268 (2018)
- 27. Real, E., Aggarwal, A., Huang, Y., Le, Q.V.: Regularized evolution for image classifier architecture search. CoRR abs/1802.01548 (2018)
- 28. Real, E., et al.: Large-scale evolution of image classifiers. In: ICML (2017)
- 29. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. CoRR abs/1707.06347 (2017)
- 30. Shahriari, B., Swersky, K., Wang, Z., Adams, R.P., de Freitas, N.: Taking the human out of the loop: a review of bayesian optimization. Proc. IEEE 104(1), 148–175 (2016)
- 31. Snoek, J., Larochelle, H., Adams, R.P.: Practical bayesian optimization of machine learning algorithms. In: NIPS (2012)
- 32. Stanley, K.O.: Neuroevolution: a different kind of deep learning, July 2017 33. Stanley, K.O., Miikkulainen, R.: Evolving neural networks through augmenting topologies. Evol. Comput. 10(2), 99–127 (2002)
- 34. Williams, R.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8, 229–256 (1992)
- 35. Xie, L., Yuille, A.L.: Genetic CNN. In: ICCV (2017)
- 36. Xie, S., Girshick, R.B., Dollar, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: CVPR (2017)
- 37. Zhang, X., Zhou, X., Lin, M., Sun, J.: ShuffleNet: an extremely efficient convolutional neural network for mobile devices. CoRR abs/1707.01083 (2017)
- 38. Zhang, X., Li, Z., Loy, C.C., Lin, D.: PolyNet: a pursuit of structural diversity in very deep networks. In: CVPR (2017)
- 39. Zhong, Z., Yan, J., Liu, C.L.: Practical network blocks design with Q-learning. In: AAAI (2018)
- 40. Zoph, B., Le, Q.V.: Neural architecture search with reinforcement learning. In: ICLR (2017)
- 41. Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V.: Learning transferable architectures for scalable image recognition. In: CVPR (2018)

Full Text

Tags

Comments