DeepRebirth: Accelerating Deep Neural Network Execution on Mobile Devices

national conference on artificial intelligence, 2018.

Cited by: 53|Bibtex|Views57
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com|arxiv.org
Weibo:
An acceleration framework – DeepRebirth is proposed to speed up the neural networks with satisfactory accuracy, which operates by re-generating new tensor layers from optimizing non-tensor layers and their neighborhood units

Abstract:

Deploying deep neural networks on mobile devices is a challenging task. Current model compression methods such as matrix decomposition effectively reduce the deployed model size, but still cannot satisfy real-time processing requirement. This paper first discovers that the major obstacle is the excessive execution time of non-tensor layer...More

Code:

Data:

0
Introduction
  • Recent years have witnessed the breakthrough of deep learning techniques for many computer vision tasks, e.g., image classification (Krizhevsky, Sutskever, and Hinton 2012; Szegedy et al 2014), object detection and tracking (Ren et al 2015; Yu et al 2016; Du et al 2017), video understanding (Donahue et al 2015; Li et al 2017), content generation (Goodfellow et al 2014; Zhang, Song, and Qi 2017), disease diagnosis (Shen, Wu, and Suk ; Zhang et al 2017) and privacy image analytics (Tran, Kong, and Liu 2016).
  • More and more mobile applications adopt deep learning techniques to provide accurate, intelligent and effective services.
  • The execution speed of deep learning models on mobile devices becomes a bottleneck for deployment of many applications due to limited computing resources.
  • The authors focus on improving the execution efficiency of deep learning models on mobile devices, which is a highly intriguing feature.
  • The authors define the execution efficiency as the model inference speed, the energy cost and the run-time memory consumption.
  • The effective solution is expected to provide minimum accuracy loss by leveraging widely used deep neural network architectures with support of deep model acceleration on different types of layers
Highlights
  • Recent years have witnessed the breakthrough of deep learning techniques for many computer vision tasks, e.g., image classification (Krizhevsky, Sutskever, and Hinton 2012; Szegedy et al 2014), object detection and tracking (Ren et al 2015; Yu et al 2016; Du et al 2017), video understanding (Donahue et al 2015; Li et al 2017), content generation (Goodfellow et al 2014; Zhang, Song, and Qi 2017), disease diagnosis (Shen, Wu, and Suk ; Zhang et al 2017) and privacy image analytics (Tran, Kong, and Liu 2016)
  • The execution speed of deep learning models on mobile devices becomes a bottleneck for deployment of many applications due to limited computing resources
  • We focus on improving the execution efficiency of deep learning models on mobile devices, which is a highly intriguing feature
  • This paper proposes DeepRebirth, a new deep learning model acceleration framework that significantly reduces the execution time on non-tensor layers
  • An acceleration framework – DeepRebirth is proposed to speed up the neural networks with satisfactory accuracy, which operates by re-generating new tensor layers from optimizing non-tensor layers and their neighborhood units
  • DeepRebirth obtained the state-of-the-art speeding up on popular deep learning models with negligible accuracy loss, which enables GoogLeNet to achieve 3x-5x speed-up for processing a single image with only 0.4% drop on Top5 accuracy on ImageNet without any weights compression method
  • By applying DeepRebirth on different deep learning architectures, we obtain significant speed-up on different processors, which will readily facilitate the deployment of deep learning models on mobile devices in the new AI tide
Methods
  • The streamline slimming regenerates a new tensor layer by merging non-tensor layers with its bottom tensor units in the feed-forward structure.
  • To absorb a non-tensor branch into a tensor branch, the authors recreate a new tensor layer by fusing the nontensor branch and a tensor unit with relatively small latency to output the feature maps that were originally generated by the non-tensor branch.
  • As shown in Figure 4, the authors re-learn a new tensor layer “inception 3a” by merging the 3 × 3 pooling branch with the 5 × 5 convolution branch at the same level, and the number of feature maps obtained by the 5 × 5 convolution is increased from 32 to 64
Results
  • To evaluate the performance of DeepRebirth, the authors performed the comprehensive evaluation on top of GoogLeNet, AlexNet and ResNet.
  • The authors apply the proposed DeepRebirth optimization scheme to accelerate the running speed of GoogLeNet, which is denoted as “GoogLeNet-Slim”.
  • After non-tensor layer optimization, the authors further apply tucker decomposition approach (Kim et al 2015) to reduce the model size by 50%, represented as “GoogLeNet-Slim-Tucker”.
  • The authors directly employ tucker decomposition method to compress original GoogLeNet. In addition, the authors directly employ tucker decomposition method to compress original GoogLeNet
  • This is indicated as “GoogLeNet-Tucker”.
  • The authors have 4 variations of GoogLeNet to compare, namely GoogLeNet, GoogLeNetSlim, GoogLeNet-Tucker and GoogLeNet-Slim-Tucker.
  • The authors compare with SqueezeNet (Iandola et al 2016), a state-of-the-art compact neural network which includes only 1.2M learnable parameters
Conclusion
  • Conclusion and Future Work

    An acceleration framework – DeepRebirth is proposed to speed up the neural networks with satisfactory accuracy, which operates by re-generating new tensor layers from optimizing non-tensor layers and their neighborhood units.
  • DeepRebirth is compatible with state-of-the-art deep models like GoogleNet and ResNet, where most parameter weight compression methods failed.
  • The authors plan to integrate DeepRebirth with other state-of-the-art tensor layer compression methods and extend the evaluation to heterogeneous mobile processors such as mobile GPUs, DSPs. The authors envision that understanding the characteristics of these different chips can help them design better algorithms and further improve the model execution efficiency
Summary
  • Introduction:

    Recent years have witnessed the breakthrough of deep learning techniques for many computer vision tasks, e.g., image classification (Krizhevsky, Sutskever, and Hinton 2012; Szegedy et al 2014), object detection and tracking (Ren et al 2015; Yu et al 2016; Du et al 2017), video understanding (Donahue et al 2015; Li et al 2017), content generation (Goodfellow et al 2014; Zhang, Song, and Qi 2017), disease diagnosis (Shen, Wu, and Suk ; Zhang et al 2017) and privacy image analytics (Tran, Kong, and Liu 2016).
  • More and more mobile applications adopt deep learning techniques to provide accurate, intelligent and effective services.
  • The execution speed of deep learning models on mobile devices becomes a bottleneck for deployment of many applications due to limited computing resources.
  • The authors focus on improving the execution efficiency of deep learning models on mobile devices, which is a highly intriguing feature.
  • The authors define the execution efficiency as the model inference speed, the energy cost and the run-time memory consumption.
  • The effective solution is expected to provide minimum accuracy loss by leveraging widely used deep neural network architectures with support of deep model acceleration on different types of layers
  • Methods:

    The streamline slimming regenerates a new tensor layer by merging non-tensor layers with its bottom tensor units in the feed-forward structure.
  • To absorb a non-tensor branch into a tensor branch, the authors recreate a new tensor layer by fusing the nontensor branch and a tensor unit with relatively small latency to output the feature maps that were originally generated by the non-tensor branch.
  • As shown in Figure 4, the authors re-learn a new tensor layer “inception 3a” by merging the 3 × 3 pooling branch with the 5 × 5 convolution branch at the same level, and the number of feature maps obtained by the 5 × 5 convolution is increased from 32 to 64
  • Results:

    To evaluate the performance of DeepRebirth, the authors performed the comprehensive evaluation on top of GoogLeNet, AlexNet and ResNet.
  • The authors apply the proposed DeepRebirth optimization scheme to accelerate the running speed of GoogLeNet, which is denoted as “GoogLeNet-Slim”.
  • After non-tensor layer optimization, the authors further apply tucker decomposition approach (Kim et al 2015) to reduce the model size by 50%, represented as “GoogLeNet-Slim-Tucker”.
  • The authors directly employ tucker decomposition method to compress original GoogLeNet. In addition, the authors directly employ tucker decomposition method to compress original GoogLeNet
  • This is indicated as “GoogLeNet-Tucker”.
  • The authors have 4 variations of GoogLeNet to compare, namely GoogLeNet, GoogLeNetSlim, GoogLeNet-Tucker and GoogLeNet-Slim-Tucker.
  • The authors compare with SqueezeNet (Iandola et al 2016), a state-of-the-art compact neural network which includes only 1.2M learnable parameters
  • Conclusion:

    Conclusion and Future Work

    An acceleration framework – DeepRebirth is proposed to speed up the neural networks with satisfactory accuracy, which operates by re-generating new tensor layers from optimizing non-tensor layers and their neighborhood units.
  • DeepRebirth is compatible with state-of-the-art deep models like GoogleNet and ResNet, where most parameter weight compression methods failed.
  • The authors plan to integrate DeepRebirth with other state-of-the-art tensor layer compression methods and extend the evaluation to heterogeneous mobile processors such as mobile GPUs, DSPs. The authors envision that understanding the characteristics of these different chips can help them design better algorithms and further improve the model execution efficiency
Tables
  • Table1: Compare DeepRebirth with Existing Acceleration Methods on CPU of Samsung Galaxy S5 Mobile Device
  • Table2: Percentage of Forwarding Time on Non-tensor Layers
  • Table3: GoogLeNet Accuracy on Slimming Each Layer
  • Table4: Layer breakdown of GoogLeNet forwarding time cost
  • Table5: Execution time using different methods (including SqueezeNet) on different processors
  • Table6: Storage, Energy and Runtime-Memory Comparison
  • Table7: AlexNet Result (Accuracy vs. Speed vs. Energy cost)
  • Table8: ResNet (conv1-res2a) Result (Accuracy vs. Speed up). For each step, we absorb the “BatchNorm” and “Scale” layers to the bottom convolution layer
Download tables as Excel
Related work
  • Reducing the model size and accelerating the running speed are two general ways to facilitate the deployment of deep learning models on mobile devices. Many efforts have been spent on reducing the model size. In particular, most works focus on optimizing tensor-layers to reduce the model size due to the high redundancy in the learned parameters in tensor layers of a given deep model. Vanhoucke et al (Vanhoucke, Senior, and Mao 2011) proposed a fixed-point implementation with 8-bit integer activation to reduce the number of parameter used in the deep neural network while (Gong et al 2014) applied vector quantization to compressed deep convnets. These approaches, however, mainly focus on compressing the fully connected layer without considering the convolutional layers. To reduce the parameter size, Denten et al (Denton et al 2014) applied the lowrank approximation approach to compress the neural networks with linear structures. Afterwards, hashing functions, which have been widely adopted to improve efficiency of traditional computer vision tasks (Wang, Kumar, and Chang 2010; Du, Abd-Almageed, and Doermann 2013), were utilized to reduce model sizes by randomly grouping connection weights (Chen et al 2015). More recently, Han et al(Han, Mao, and Dally 2016) proposed to effectively reduce model size and achieve speed-up by the combination of pruning, Huffman coding and quantization. However, the benefits can only be achieved by running the compressed model on a specialized processor (Han et al 2016).
Funding
  • As observed in the experiment, DeepRebirth achieves more than 3x speed-up and 2.5x run-time memory saving on GoogLeNet with only 0.4% drop on top-5 accuracy in ImageNet
  • Furthermore, by combining with other model compression techniques, DeepRebirth offers an average of 106.3ms inference time on the CPU of Samsung Galaxy S5 with 86.5% top-5 accuracy, 14% faster than SqueezeNet which only has a top-5 accuracy of 80.5%
  • In the original paper, the authors reported a small 0.24% accuracy loss with compressed rate 31.9%
  • For our model at the same 31.9% compression rate, we also only have a small 0.31% accuracy loss
  • • DeepRebirth obtained the state-of-the-art speeding up on popular deep learning models with negligible accuracy loss, which enables GoogLeNet to achieve 3x-5x speed-up for processing a single image with only 0.4% drop on Top5 accuracy on ImageNet without any weights compression method
  • DeepRebirth achieves around 106.3 ms for processing a single image with Top-5 accuracy up to 86.5%
  • ResNet-50 has abandoned the “LRN” layers by introducing the batch normalization layer, but the findings remain valid as it takes up more than 25% of the time on ARM CPU and more than 40% on Intel x86 CPU (in Caffe (Jia et al 2014), it was decomposed into a “BatchNorm” layer followed by a “Scale” layer as shown in Figure 2c)
  • Directly applying tucker decomposition method (GoogLeNet-Tucker) to reduce the GoogLeNet weights to a half drops the top-5 accuracy to 85.7%
  • The Tucker Decomposition method further reduces the computation for around 50% at the cost of around 2% accuracy loss
  • Comparing the proposed approach with SqueezeNet (Iandola et al 2016), we are very excited to see that our optimization approach can obtain faster speed on all mobile devices with much higher accuracy (the Top-5 accuracy for SqueezeNet is 80%) as listed in Table 5
  • We illustrate the result in Table 7. This indicates that by applying slimming to the first two layers, the model forwarding time of AlexNet is reduced from 445 ms to 274 ms on Samsung Galaxy S5, and the Top-5 accuracy is slightly dropped from 80.03% to 79.57%
Reference
  • Arora, S.; Bhaskara, A.; Ge, R.; and Ma, T. 2013. Provable bounds for learning some deep representations. CoRR abs/1310.6343.
    Findings
  • Bottou, L. 201Stochastic Gradient Tricks, volume 7700. Springer. 430445.
    Google ScholarLocate open access versionFindings
  • Chen, W.; Wilson, J. T.; Tyree, S.; Weinberger, K. Q.; and Chen, Y. 2015. Compressing neural networks with the hashing trick. CoRR, abs/1504.04788.
    Findings
  • Denton, E. L.; Zaremba, W.; Bruna, J.; LeCun, Y.; and Fergus, R. 201Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in Neural Information Processing Systems, 1269–1277.
    Google ScholarLocate open access versionFindings
  • Donahue, J.; Anne Hendricks, L.; Guadarrama, S.; Rohrbach, M.; Venugopalan, S.; Saenko, K.; and Darrell, T. 201Longterm recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2625–2634.
    Google ScholarLocate open access versionFindings
  • Du, X.; Abd-Almageed, W.; and Doermann, D. S. 2013. Large-scale signature matching using multi-stage hashing. In 2013 12th International Conference on Document Analysis and Recognition, Washington, DC, USA, August 25-28, 2013, 976–980.
    Google ScholarLocate open access versionFindings
  • Du, X.; El-Khamy, M.; Lee, J.; and Davis, L. S. 201Fused DNN: A deep neural network fusion approach to fast and robust pedestrian detection. In 2017 IEEE Winter Conference on Applications of Computer Vision, WACV 2017, Santa Rosa, CA, USA, March 24-31, 2017, 953–961.
    Google ScholarLocate open access versionFindings
  • Glorot, X., and Bengio, Y. 2010. Understanding the difficulty of training deep feedforward neural networks. In In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS10). Society for Artificial Intelligence and Statistics.
    Google ScholarLocate open access versionFindings
  • Gong, Y.; Liu, L.; Yang, M.; and Bourdev, L. 2014. Compressing deep convolutional networks using vector quantization. arXiv preprint arXiv:1412.6115.
    Findings
  • Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; WardeFarley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In Advances in neural information processing systems, 2672–2680.
    Google ScholarLocate open access versionFindings
  • Han, S.; Liu, X.; Mao, H.; Pu, J.; Pedram, A.; Horowitz, M. A.; and Dally, W. J. 2016. Eie: Efficient inference engine on compressed deep neural network. International Conference on Computer Architecture (ISCA).
    Google ScholarLocate open access versionFindings
  • Han, S.; Mao, H.; and Dally, W. J. 2016. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. International Conference on Learning Representations (ICLR).
    Google ScholarLocate open access versionFindings
  • He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385.
    Findings
  • Hinton, G.; Vinyals, O.; and Dean, J. 2015. Distilling the Knowledge in a Neural Network. ArXiv e-prints.
    Google ScholarFindings
  • Howard, A. G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; and Adam, H. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR abs/1704.04861.
    Findings
  • Iandola, F. N.; Moskewicz, M. W.; Ashraf, K.; Han, S.; Dally, W. J.; and Keutzer, K. 20Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1mb model size. arXiv:1602.07360.
    Findings
  • Ioffe, S., and Szegedy, C. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, 448–456.
    Google ScholarLocate open access versionFindings
  • Jia, Y.; Shelhamer, E.; Donahue, J.; Karayev, S.; Long, J.; Girshick, R.; Guadarrama, S.; and Darrell, T. 2014. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093.
    Findings
  • Kim, Y.; Park, E.; Yoo, S.; Choi, T.; Yang, L.; and Shin, D. 2015. Compression of deep convolutional neural networks for fast and low power mobile applications. CoRR abs/1511.06530.
    Findings
  • Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, 1097– 1105.
    Google ScholarLocate open access versionFindings
  • Li, W.; Wen, L.; Chang, M.-C.; Nam Lim, S.; and Lyu, S. 2017. Adaptive rnn tree for large-scale human action recognition. In The IEEE International Conference on Computer Vision (ICCV).
    Google ScholarLocate open access versionFindings
  • Razavian, A. S.; Azizpour, H.; Sullivan, J.; and Carlsson, S. 2014. CNN features off-the-shelf: an astounding baseline for recognition. CoRR abs/1403.6382.
    Findings
  • Ren, S.; He, K.; Girshick, R. B.; and Sun, J. 2015. Faster RCNN: towards real-time object detection with region proposal networks. CoRR abs/1506.01497.
    Findings
  • Springenberg, J. T.; Dosovitskiy, A.; Brox, T.; and Riedmiller, M. A. 2014. Striving for simplicity: The all convolutional net. CoRR abs/1412.6806.
    Findings
  • Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S. E.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; and Rabinovich, A. 2014. Going deeper with convolutions. CoRR abs/1409.4842.
    Findings
  • Tran, L.; Kong, D.; and Liu, J. 2016. Privacy-cnh: A framework to detect photo privacy with convolutional neural network using hierarchical features. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 1217, 2016, Phoenix, Arizona, USA., 1317–1323.
    Google ScholarLocate open access versionFindings
  • Vanhoucke, V.; Senior, A.; and Mao, M. Z. 2011. Improving the speed of neural networks on cpus.
    Google ScholarFindings
  • Wang, J.; Kumar, S.; and Chang, S.-F. 2010. Semi-supervised hashing for scalable image retrieval. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, 3424–3431. IEEE.
    Google ScholarFindings
  • Xianyi, Z.; Qian, W.; and Chothia, Z. 2014. Openblas. URL: http://xianyi.github.io/OpenBLAS.
    Findings
  • Yosinski, J.; Clune, J.; Bengio, Y.; and Lipson, H. 2014. How transferable are features in deep neural networks? CoRR abs/1411.1792.
    Findings
  • Yu, F.; Li, W.; Li, Q.; Liu, Y.; Shi, X.; and Yan, J. 2016. POI: multiple object tracking with high performance detection and appearance feature. In ECCV Workshops.
    Google ScholarFindings
  • Yu, X.; Liu, T.; Wang, X.; and Tao, D. 2017. On compressing deep models by low rank and sparse decomposition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
    Google ScholarLocate open access versionFindings
  • Zhang, J.; Li, Q.; Caselli, R. J.; Ye, J.; and Wang, Y. 2017. Multi-task dictionary learning based convolutional neural network for computer aided diagnosis with longitudinal images. CoRR abs/1709.00042.
    Findings
  • Zhang, Z.; Song, Y.; and Qi, H. 2017. Age progression/regression by conditional adversarial autoencoder. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments