AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
We proposed Once for All, a new methodology that decouples model training from architecture search for efficient deep learning deployment under a large number of deployment scenarios
Once for All: Train One Network and Specialize it for Efficient Deployment
international conference on learning representations, (2020)
We address the challenging problem of efficient deep learning model deployment across many devices, where the goal is to design neural network architectures that can fit diverse hardware platform constraints: from the cloud to the edge. Most of the traditional approaches either manually design or use neural architecture search (NAS) to fi...More
PPT (Upload PPT)
- Deep Neural Networks (DNNs) deliver state-of-the-art accuracy in many machine learning applications.
- Designing specialized DNNs for every scenario is engineer-expensive and computationally expensive, either with human-based methods or NAS.
- Since such methods need to repeat the network design process and retrain the designed network from scratch for each case.
- It makes them unable to handle the vast amount of hardware devices (23.14 billion IoT devices till 20181) and highly dynamic deployment environments
- Deep Neural Networks (DNNs) deliver state-of-the-art accuracy in many machine learning applications
- This paper introduces a new solution to tackle this challenge – designing a once-for-all network that can be directly deployed under diverse architectural configurations, amortizing the training cost
- once-for-all network consistently improves the trade-off between accuracy and latency by a significant margin, especially on GPUs which have more parallelism
- It reveals the insight that using the same model for different deployment scenarios with only the width multiplier modified has a limited impact on efficiency improvement: the accuracy drops quickly as the latency constraint gets tighter
- We proposed Once for All (OFA), a new methodology that decouples model training from architecture search for efficient deep learning deployment under a large number of deployment scenarios
- To prevent sub-networks of different sizes from interference, we proposed a progressive shrinking algorithm that enables a large number of sub-network to achieve the same level of accuracy compared to training them independently
- Wo archi where C(Wo, archi) denotes a selection scheme that selects part of the model from the once-for-all network Wo to form a sub-network with architectural configuration archi.
- The overall training objective is to optimize Wo to make each supported sub-network maintain the same level of accuracy as independently training a network with the same architectural configuration.
- Each unit consists of a sequence of layers where only the first layer has stride 2 if the feature map size decreases (Sandler et al, 2018).
- Figure 5 summarizes the results of OFA under different FLOPs and Pixel1 latency constraints.
- Figure 6 reports detailed comparisons between OFA and MobileNetV3 on 6 mobile devices.
- OFA can produce the entire trade-off curves with many points over a wide range of latency constraints by training only once.
- It reveals the insight that using the same model for different deployment scenarios with only the width multiplier modified has a limited impact on efficiency improvement: the accuracy drops quickly as the latency constraint gets tighter.
- The profiling results are summarized in Figure 8, while the roofline models (Williams et al, 2009) on
- We proposed Once for All (OFA), a new methodology that decouples model training from architecture search for efficient deep learning deployment under a large number of deployment scenarios.
- Unlike previous approaches that design and train a neural network for each deployment scenario, we (a) on Xilinx ZU9EG FPGA (b) on Xilinx ZU3EG FPGA.
- ZU3EG 4.1mZUs3(REG= 146.14m) s (R = 164) ZU3EG 4.1ms (R = 164).
- CCooCnonvnvv33x3x3x33 MMMBB11B133x3x3x33 MMB4B43x33x3 MB4 3x3 MMB4B43x33x3 MB4 3x3 MMB4B43x33x3 MB4 3x3 MMB5B53x33x3 MB5 3x3 MMB5B53x33x3 MB5 3x3 MMB5B53x33x3 MB5 3x3 MMB5B53x33x3 MB5 3x3 MMB4B43x33x3 MB4 3x3 MMB5B53x33x3 MB5 3x3 MMB6B63x33x3 MB6 3x3 MMB5B53x33x3 MMMBB65B633x3x3x33 PPoMooBlio6lni3gnxgF3FCC Pooling FC
- Table1: ImageNet top1 accuracy (%) performances of sub-networks under resolution 224 × 224. “(D = d, W = w, K = k)” denotes a sub-network with d layers in each unit, and each layer has an width expansion ratio w and kernel size k. “Mbv3-L” denotes “MobileNetV3-Large”
- Table2: Compariso1n43.0with SOT80A.10 hardware-aware NAS methods375on Pixel179p.8 hone. OFA decouples model training from architecture search. The search cost and training cost both stay constant as the number of deployment scenarios grows. “#25” denotes the specialized sub-networks are fine-tuned for 25 epochs after grabbing weights from the once-for-all network. “CO2e” denotes CO2 emission which is calculated based on <a class="ref-link" id="cStrubell_et+al_2019_a" href="#rStrubell_et+al_2019_a">Strubell et al (2019</a>). AWS cost is calculated based on the price of on-demand P3.16xlarge instances
- Efficient Deep Learning. Many efficient neural network architectures are proposed to improve the hardware efficiency, such as SqueezeNet (Iandola et al, 2016), MobileNets (Howard et al, 2017; Sandler et al, 2018), ShuffleNets (Ma et al, 2018; Zhang et al, 2018), etc. Orthogonal to architecting efficient neural networks, model compression (Han et al, 2016) is another very effective technique
Untitled 1 Untitled 2 Untitled 3 Untitled 4
for efficient deep learning, including network pruning that removes redundant units (Han et al, 2015) or redundant channels (He et al, 2018; Liu et al, 2017), and quantization that reduces the bit width for the weights and activations (Han et al, 2016; Courbariaux et al, 2015; Zhu et al, 2017).
Neural Architecture Search. Neural architecture search (NAS) focuses on automating the architecture design process (Zoph & Le, 2017; Zoph et al, 2018; Real et al, 2019; Cai et al, 2018a; Liu et al, 2019). Early NAS methods (Zoph et al, 2018; Real et al, 2019; Cai et al, 2018b) search for high-accuracy architectures without taking hardware efficiency into consideration. Therefore, the produced architectures (e.g., NASNet, AmoebaNet) are not efficient for inference. Recent hardwareaware NAS methods (Cai et al, 2019; Tan et al, 2019; Wu et al, 2019) directly incorporate the hardware feedback into architecture search. As a result, they are able to improve inference efficiency. However, given new inference hardware platforms, these methods need to repeat the architecture search process and retrain the model, leading to prohibitive GPU hours, dollars and CO2 emission. They are not scalable to a large number of deployment scenarios. The individually trained models do not share any weight, leading to a large total model size and high downloading bandwidth.
- We thank NSF Career Award #1943349, MIT-IBM Watson AI Lab, Google-Daydream Research Award, Samsung, Intel, Xilinx, SONY, AWS Machine Learning Research Award for supporting this research
Study subjects and analysis
Training cost (GPU hours). Total cost (N = 40) GPU hours CO2e (lbs) AWS cost. 72.0 300M 66ms
- Anubhav Ashok, Nicholas Rhinehart, Fares Beainy, and Kris M Kitani. N2n learning: Network to network compression via policy gradient reinforcement learning. In ICLR, 2018. 4
- Han Cai, Tianyao Chen, Weinan Zhang, Yong Yu, and Jun Wang. Efficient architecture search by network transformation. In AAAI, 2018a. 3
- Han Cai, Jiacheng Yang, Weinan Zhang, Song Han, and Yong Yu. Path-level network transformation for efficient architecture search. In ICML, 2018b. 3
- Han Cai, Ligeng Zhu, and Song Han. ProxylessNAS: Direct neural architecture search on target task and hardware. In ICLR, 2019. URL https://arxiv.org/pdf/1812.00332.pdf.3, 5, 6, 7, 8
- Brian Cheung, Alex Terekhov, Yubei Chen, Pulkit Agrawal, and Bruno Olshausen. Superposition of many models into one. In NeurIPS, 2019. 4
- Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In NeurIPS, 2015. 3
- Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009. 6
- Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, Zechun Liu, Yichen Wei, and Jian Sun. Single path one-shot neural architecture search with uniform sampling. arXiv preprint arXiv:1904.00420, 2019. 7
- Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In NeurIPS, 2015. 3
- Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In ICLR, 2016. 1, 2, 3
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. 3
- Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. Amc: Automl for model compression and acceleration on mobile devices. In ECCV, 2018. 1, 3
- Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. 4
- Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mobilenetv3. In ICCV 2019, 2019. 6, 7
- Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017. 1, 2
- Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In CVPR, 2017. 3
- Gao Huang, Danlu Chen, Tianhong Li, Felix Wu, Laurens van der Maaten, and Kilian Q Weinberger. Multi-scale dense networks for resource efficient image classification. In ICLR, 2018. 3
- Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016. 2
- Jason Kuen, Xiangfei Kong, Zhe Lin, Gang Wang, Jianxiong Yin, Simon See, and Yap-Peng Tan. Stochastic downsampling for cost-adjustable inference and improved regularization in convolutional networks. In CVPR, 2018. 3
- Ji Lin, Yongming Rao, Jiwen Lu, and Jie Zhou. Runtime neural pruning. In NeurIPS, 2017. 3
- Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In ECCV, 2018. 2
- Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. In ICLR, 2019. 3, 7
- Lanlan Liu and Jia Deng. Dynamic deep neural networks: Optimizing accuracy-efficiency trade-offs by selective execution. In AAAI, 2018. 3
- Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. In ICCV, 2017. 3
- Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016. 6
- Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In ECCV, 2018. 2
- Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution for image classifier architecture search. In AAAI, 2019. 3, 6
- Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR, 2018. 1, 2, 3, 7
- Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep learning in nlp. In ACL, 2019. 1, 7
- Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V Le. Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2820–2828, 2019. 3, 7, 8
- Xin Wang, Fisher Yu, Zi-Yi Dou, Trevor Darrell, and Joseph E Gonzalez. Skipnet: Learning dynamic routing in convolutional networks. In ECCV, 2018. 3
- Samuel Williams, Andrew Waterman, and David Patterson. Roofline: An insightful visual performance model for floating-point programs and multicore architectures. Technical report, Lawrence Berkeley National Lab.(LBNL), Berkeley, CA (United States), 2009. 8
- Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search. In CVPR, 2019. 3, 5, 7
- Zuxuan Wu, Tushar Nagarajan, Abhishek Kumar, Steven Rennie, Larry S Davis, Kristen Grauman, and Rogerio Feris. Blockdrop: Dynamic inference paths in residual networks. In CVPR, 2018. 3
- Jiahui Yu and Thomas Huang. Autoslim: Towards one-shot architecture search for channel numbers. arXiv preprint arXiv:1903.11728, 2019a. 7
- Jiahui Yu and Thomas Huang. Universally slimmable networks and improved training techniques. In ICCV, 2019b. 3, 4
- Jiahui Yu, Linjie Yang, Ning Xu, Jianchao Yang, and Thomas Huang. Slimmable neural networks. In ICLR, 2019. 3, 4
- Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In CVPR, 2018. 1, 2
- Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. Trained ternary quantization. In ICLR, 2017. 3
- Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. In ICLR, 2017. 3
- Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In CVPR, 2018. 3, 7