Deep Convolutional Neural Networks with Layer-Wise Context Expansion and Attention

INTERSPEECH, pp. 17-21, 2016.

Cited by: 60|Bibtex|Views130
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com
Weibo:
We proposed a novel deep convolutional neural network for large vocabulary speech recognition

Abstract:

In this paper, we propose a deep convolutional neural network (CNN) with layer-wise context expansion and location-based attention, for large vocabulary speech recognition. In our model each higher layer uses information from broader contexts, along both the time and frequency dimensions, than its immediate lower layer. We show that both ...More

Code:

Data:

Introduction
  • Since 2010, the year in which deep neural networks (DNNs) were successfully applied to the large vocabulary speech recognition (LVSR) [1, 2, 3] tasks and led to significant recognition accuracy improvement over the state of the art, various deep learning models have been developed to further improve the performance of speech recognition systems
  • The majority of these new models are variations and/or combinations of the recurrent neural networks (RNNs) and convolution neural networks (CNNs) [4].
  • The pooling layer is often important to tolerate translational variances
Highlights
  • Since 2010, the year in which deep neural networks (DNNs) were successfully applied to the large vocabulary speech recognition (LVSR) [1, 2, 3] tasks and led to significant recognition accuracy improvement over the state of the art, various deep learning models have been developed to further improve the performance of speech recognition systems
  • We propose a deep convolutional neural network that operates in both frequency and time dimensions
  • We propose the LAyer-wise Context Expansion and Attention (LACEA) model as shown in Figure 1 after noticing that the length of useful context is limited for phonemestate recognition
  • We proposed a novel deep convolutional neural network for large vocabulary speech recognition
  • No pooling is used in our model since pooling can be better implemented using a convolution operation as well
  • Our work indicates that the full potential of convolutional neural network may yet to be explored
Methods
  • The authors' models were built using the computational network toolkit (CNTK) [31].
  • The experiments were carried out on a GPU cluster that is optimized for CNTK for rapid no-hassle deep learning model training and evaluation.
  • Each GPU machine in the cluster contains four K40 GPU cards.
  • To speed up the experiments, the authors have exploited the 1-bit quantized SGD algorithm [32] built into CNTK and run all the experiments on 8 GPUs across two computers
Conclusion
  • The authors employed layer-wise context expansion and attention for more powerful modeling and jump connection for better convergence.
  • No pooling is used in the model since pooling can be better implemented using a convolution operation as well.
  • The authors' work indicates that the full potential of CNNs may yet to be explored.
  • This is the first attempt to build very deep CNNs. There are many dimensions to explore.
  • By combining the model with existing techniques such as sequence discriminative training, further improvements should be achievable
Summary
  • Introduction:

    Since 2010, the year in which deep neural networks (DNNs) were successfully applied to the large vocabulary speech recognition (LVSR) [1, 2, 3] tasks and led to significant recognition accuracy improvement over the state of the art, various deep learning models have been developed to further improve the performance of speech recognition systems
  • The majority of these new models are variations and/or combinations of the recurrent neural networks (RNNs) and convolution neural networks (CNNs) [4].
  • The pooling layer is often important to tolerate translational variances
  • Methods:

    The authors' models were built using the computational network toolkit (CNTK) [31].
  • The experiments were carried out on a GPU cluster that is optimized for CNTK for rapid no-hassle deep learning model training and evaluation.
  • Each GPU machine in the cluster contains four K40 GPU cards.
  • To speed up the experiments, the authors have exploited the 1-bit quantized SGD algorithm [32] built into CNTK and run all the experiments on 8 GPUs across two computers
  • Conclusion:

    The authors employed layer-wise context expansion and attention for more powerful modeling and jump connection for better convergence.
  • No pooling is used in the model since pooling can be better implemented using a convolution operation as well.
  • The authors' work indicates that the full potential of CNNs may yet to be explored.
  • This is the first attempt to build very deep CNNs. There are many dimensions to explore.
  • By combining the model with existing techniques such as sequence discriminative training, further improvements should be achievable
Tables
  • Table1: Word error rate on the SWB task
  • Table2: Word error rate on the SMD task
Download tables as Excel
Related work
  • Various CNN structures [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24] have been evaluated for speech recognition. The ones most similar to our work are [20, 24, 26] which employed the famous very deep VGG network structure. These works, however, do not include the attention mechanism, the jump connection, and the weighted sum at the top layer.

    The layer-wise context expansion has been studied in [27, 21, 22] under different names and setups. However, as pointed out by Amodei et al [21], the basic block of the computation can be implemented as a convolution operation and is in fact called “row convolution” in their work. Note that while in [21]
Reference
  • D. Yu, L. Deng, and G. E. Dahl, “Roles of pre-training and fine-tuning in context-dependent dbn-hmms for real-world speech recognition.”
    Google ScholarFindings
  • G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition,” IEEE Transactions on Audio, Speech and Language Processing, vol. 20, no. 1, pp. 30–42, 2012.
    Google ScholarLocate open access versionFindings
  • F. Seide, G. Li, and D. Yu, “Conversational speech transcription using context-dependent deep neural networks.” in Proc. Annual Conference of International Speech Communication Association (INTERSPEECH), 2011, pp. 437–440.
    Google ScholarLocate open access versionFindings
  • Y. LeCun and Y. Bengio, “Convolutional networks for images, speech, and time series,” The handbook of brain theory and neural networks, vol. 3361, no. 10, p. 1995, 1995.
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” arXiv preprint arXiv:1512.03385, 2015.
    Findings
  • O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, and G. Penn, “Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition,” in Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012, pp. 4277–4280.
    Google ScholarLocate open access versionFindings
  • O. Abdel-Hamid, L. Deng, and D. Yu, “Exploring convolutional neural network structures and optimization techniques for speech recognition.” in Proc. Annual Conference of International Speech Communication Association (INTERSPEECH), 2013, pp. 3366– 3370.
    Google ScholarLocate open access versionFindings
  • T. N. Sainath, A.-r. Mohamed, B. Kingsbury, and B. Ramabhadran, “Deep convolutional neural networks for LVCSR,” in Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013, pp. 8614–8618.
    Google ScholarLocate open access versionFindings
  • P. Swietojanski, A. Ghoshal, and S. Renals, “Convolutional neural networks for distant speech recognition,” IEEE Signal Processing Letters, vol. 21, no. 9, pp. 1120–1124, 2014.
    Google ScholarLocate open access versionFindings
  • O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, “Convolutional neural networks for speech recognition,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 22, no. 10, pp. 1533–1545, 2014.
    Google ScholarLocate open access versionFindings
  • T. N. Sainath, R. J. Weiss, K. W. Wilson, A. Narayanan, M. Bacchiani, and A. Senior, “Speaker location and microphone spacing invariant acoustic modeling from raw multichannel waveforms,” in Proc. IEEE Workshop on Automfatic Speech Recognition and Understanding (ASRU), 2015.
    Google ScholarLocate open access versionFindings
  • T. N. Sainath, O. Vinyals, A. Senior, and H. Sak, “Convolutional, long short-term memory, fully connected deep neural networks,” in Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 4580–4584.
    Google ScholarLocate open access versionFindings
  • T. N. Sainath, R. J. Weiss, A. Senior, K. W. Wilson, and O. Vinyals, “Learning the speech front-end with raw waveform CLDNNs,” in Proc. Annual Conference of International Speech Communication Association (INTERSPEECH), 2015.
    Google ScholarLocate open access versionFindings
  • T. Yoshioka, S. Karita, and T. Nakatani, “Far-field speech recognition using CNN-DNN-HMM with convolution in time,” in Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 4360–4364.
    Google ScholarLocate open access versionFindings
  • J.-T. Huang, J. Li, and Y. Gong, “An analysis of convolutional neural networks for speech recognition,” in Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 4989–4993.
    Google ScholarLocate open access versionFindings
  • W. Chan and I. Lane, “Deep convolutional neural networks for acoustic modeling in low resource languages,” in Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 2056–2060.
    Google ScholarLocate open access versionFindings
  • L. Toth, “Modeling long temporal contexts in convolutional neural network-based phone recognition,” in Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 4575–4579.
    Google ScholarLocate open access versionFindings
  • P. Golik, Z. Tuske, R. Schluter, and H. Ney, “Convolutional neural networks for acoustic modeling of raw time signal in LVCSR,” in Proc. Annual Conference of International Speech Communication Association (INTERSPEECH), 2015.
    Google ScholarLocate open access versionFindings
  • D. Palaz, R. Collobert et al., “Analysis of CNN-based speech recognition system using raw speech as input,” in Proc. Annual Conference of International Speech Communication Association (INTERSPEECH), 2015.
    Google ScholarLocate open access versionFindings
  • M. Bi, Y. Qian, and K. Yu, “Very deep convolutional neural networks for LVCSR,” in Proc. Annual Conference of International Speech Communication Association (INTERSPEECH), 2015.
    Google ScholarLocate open access versionFindings
  • D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen, M. Chrzanowski, A. Coates, G. Diamos et al., “Deep speech 2: End-to-end speech recognition in English and Mandarin,” arXiv preprint arXiv:1512.02595, 2015.
    Findings
  • V. Peddinti, D. Povey, and S. Khudanpur, “A time delay neural network architecture for efficient modeling of long temporal contexts,” in Proc. Annual Conference of International Speech Communication Association (INTERSPEECH), 2015.
    Google ScholarLocate open access versionFindings
  • T. Zhao, Y. Zhao, and X. Chen, “Time-frequency kernelbased CNN for speech recognition,” in Proc. Annual Conference of International Speech Communication Association (INTERSPEECH), 2015.
    Google ScholarLocate open access versionFindings
  • T. Sercu, C. Puhrsch, B. Kingsbury, and Y. LeCun, “Very deep multilingual convolutional neural networks for LVCSR,” in Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016.
    Google ScholarLocate open access versionFindings
  • P. Ghahremani, J. Droppo, and M. L. Seltzer, “Linearly augmented deep neural network,” in Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016.
    Google ScholarLocate open access versionFindings
  • V. Mitra and H. Franco, “Time-frequency convolutional networks for robust speech recognition,” in Proc. IEEE Workshop on Automfatic Speech Recognition and Understanding (ASRU), 2015, pp. 317–323.
    Google ScholarLocate open access versionFindings
  • S. Zhang, C. Liu, H. Jiang, S. Wei, L. Dai, and Y. Hu, “Feedforward sequential memory networks: A new structure to learn long-term dependency,” arXiv preprint arXiv:1512.08301, 2015.
    Findings
  • Y. Zhang, G. Chen, D. Yu, K. Yao, S. Khudanpur, and J. Glass, “Highway long short-term memory rnns for distant speech recognition,” Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016.
    Google ScholarLocate open access versionFindings
  • D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
    Findings
  • J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” in Proc. Neural Information Processing Systems (NIPS), 2015, pp. 577– 585.
    Google ScholarLocate open access versionFindings
  • A. Agarwal, E. Akchurin, C. Basoglu, G. Chen, S. Cyphers, J. Droppo, A. Eversole, B. Guenter, M. Hillebrand, R. Hoens, X. Huang, Z. Huang, V. Ivanov, A. Kamenev, P. Kranen, O. Kuchaiev, W. Manousek, A. May, B. Mitra, O. Nano, G. Navarro, A. Orlov, M. Padmilac, H. Parthasarathi, B. Peng, A. Reznichenko, F. Seide, M. L. Seltzer, M. Slaney, A. Stolcke, H. Wang, Y. Wang, K. Yao, D. Yu, Y. Zhang, and G. Zweig, “An introduction to computational networks and the computational network toolkit,” Microsoft Technical Report MSR-TR-2014-112, Tech. Rep., 2014.
    Google ScholarFindings
  • F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu, “1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs.” in Proc. Annual Conference of International Speech Communication Association (INTERSPEECH), 2014, pp. 1058–1062.
    Google ScholarLocate open access versionFindings
  • D. Povey, V. Peddinti, D. Galvez, P. Ghahrmani, V. Manohar, X. Na, Y. Wang, and S. Khudanpur, “Purely sequence-trained neural networks for ASR based on lattice-free MMI,” Submitted to Interspeech, 2016.
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments