In this paper, motivated by successes in computer vision, we have explored using rectified linear units and dropout in deep neural nets for large vocabulary continuous speech recognition for the first time
Improving deep neural networks for LVCSR using rectified linear units and dropout.
ICASSP, pp.8609-8613, (2013)
Recently, pre-trained deep neural networks (DNNs) have outperformed traditional acoustic models based on Gaussian mixture models (GMMs) on a variety of large vocabulary speech recognition benchmarks. Deep neural nets have also achieved excellent results on various computer vision tasks using a random "dropout" procedure that drastically i...更多
下载 PDF 全文
- Up until a few years ago, most state of the art speech recognition systems were based on hidden Markov models (HMMs) that used mixtures of Gaussians to model the HMM emission distributions.
-  showed that hybrid acoustic models that replaced Gaussian mixture models (GMMs) with pretrained, deep neural networks (DNNs) could drastically improve performance on a small-scale phone recognition task, results that were later extended to a large vocabulary voice search task in .
- Dropout is a technique for avoiding overfitting in neural networks that has been highly effective on non-speech tasks and the small-scale TIMIT phone recognition task , it can increase training time.
- Rectified linear units are a natural choice to combine with dropout for LVCSR
- Up until a few years ago, most state of the art speech recognition systems were based on hidden Markov models (HMMs) that used mixtures of Gaussians to model the hidden Markov models emission distributions
-  showed that hybrid acoustic models that replaced Gaussian mixture models (GMMs) with pretrained, deep neural networks (DNNs) could drastically improve performance on a small-scale phone recognition task, results that were later extended to a large vocabulary voice search task in 
- We show that the modified deep neural networks (DNNs) using rectified linear unit and dropout provide a 4.2% relative error reduction over a standard pretrained deep neural networks and a 14.4% relative improvement over a strong Gaussian mixture models-hidden Markov models system
- In this paper, motivated by successes in computer vision, we have explored using rectified linear units and dropout in deep neural nets for large vocabulary continuous speech recognition for the first time
- Given the gains from dropout and rectified linear unit for frame level training, determining the best way to exploit dropout in full sequence training is an exciting direction for future work
- Fixing the units that are dropped out during each conjugate gradient run within Hessian-free optimizer might be sufficient to allow dropout to be combined with Hessian-free optimizer, but to date all uses of dropout in the literature have been with first order optimization algorithms
- Temporal context is included by splicing 9 successive frames of PLP features into supervectors, projecting to 40 dimensions using linear discriminant analysis (LDA).
- The feature space is further diagonalized using a global semi-tied covariance (STC) transform.
- The GMMs are speaker-adaptively trained, with a feature-space maximum likelihood linear transform estimated per speaker in training and testing.
- Following maximum-likelihood training of the GMMs, featurespace discriminative training and model-space discriminative training are done using the boosted maximum mutual information (BMMI) criterion.
- At test time, unsupervised adaptation using regression tree MLLR is performed.
- The GMMs use 2,203 quinphone states and 150K diagonalcovariance Gaussians
- As shown in the table, the single best model uses ReLUs, dropout, and relatively large hidden layers with weights initialized using unsupervised RBM pre-training
- This model achieved a word error rate of 18.5 on the test set, beating other deep neural net models and the strong discriminatively trained GMM baseline, even before any full sequence training of the neural net models.
- CONCLUSIONS AND FUTURE
In this paper, motivated by successes in computer vision, the authors have explored using rectified linear units and dropout in deep neural nets for LVCSR for the first time.
- ReLUs and dropout yielded word error rate improvements relative to a state of the art baseline and, with the help of Bayesian optimization software, the authors were able to obtain these improvements without much of the hand tuning typically used to obtain the very best deep neural net results.
- It is a promising alternative the authors hope to explore
- Table1: Results without full sequence training (with cross entropy a.k.a CE). All models used pre-training unless “no PT” was specified and used 1k,2k, or 3k hidden units per layer
- Table2: Results with full sequence training
- Shows on a 50-hour English Broadcast News task that modified deep neural networks using ReLUs trained with dropout during frame level training provide an 4.2% relative improvement over a DNN trained with sigmoid units, and a 14.4% relative improvement over a strong GMM/HMM system
- Explores the behavior of deep neural nets using ReLUs and dropout on a 50-hour broadcast news task , focussing our experiments on using dropout during the frame level training phase
- Shows that the modified deep neural networks using ReLUs and dropout provide a 4.2% relative error reduction over a standard pretrained DNN and a 14.4% relative improvement over a strong GMM-HMM system
- Describes the dropout method, while in Section 3 Rectified Linear units are presented
- Found it useful to only do 2.5 epochs of pre-training for each layer compared to more than twice that for sigmoid units
- Abdel rahman Mohamed, George E. Dahl, and Geoffrey E. Hinton, “Acoustic modeling using deep belief networks,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 20, no. 1, pp. 14 –22, jan. 2012.
- Geore E. Dahl, Dong Yu, Li Deng, and Alex Acero, “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 20, no. 1, pp. 30 –42, jan. 2012.
- Geoffrey E. Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, and Brian Kingsbury, “Deep neural networks for acoustic modeling in speech recognition,” Signal Processing Magazine, 2012.
- Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” The Computing Research Repository (CoRR), vol. abs/1207.0580, 2012.
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Neural Information Processing Systems, 2012.
- Brian Kingsbury, Tara N. Sainath, and Hagen Soltau, “Scalable Minimum Bayes Risk Training of Deep Neural Network Acoustic Models Using Distributed Hessian-free Optimization,” in Proc. Interspeech, 2012.
- Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol, “Extracting and composing robust features with denoising autoencoders,” in Proceedings of the Twenty-fifth International Conference on Machine Learning (ICML’08), William W. Cohen, Andrew McCallum, and Sam T. Roweis, Eds. 2008, pp. 1096–1103, ACM.
- Kevin Jarrett, Koray Kavukcuoglu, Marc’Aurelio Ranzato, and Yann LeCun, “What is the best multi-stage architecture for object recognition?,” in Proc. International Conference on Computer Vision (ICCV’09). 2009, pp. 2146–2153, IEEE.
- Vinod Nair and Geoffrey E. Hinton, “Rectified linear units improve restricted Boltzmann machines,” in Proceedings of the 27th International Conference on Machine Learning (ICML-10), Johannes Furnkranz and Thorsten Joachims, Eds., Haifa, Israel, June 2010, pp. 807–814, Omnipress.
- Navdeep Jaitly and Geoffrey E. Hinton, “Learning a better representation of speech soundwaves using restricted Boltzmann machines,” in ICASSP, 2011, pp. 5884–5887.
- Hagen Soltau, George Saon, and Brian Kingsbury, “The IBM Attila Speech Recognition Toolkit,” in Proc. SLT, 2010.
- Tara N. Sainath, Brian Kingsbury, Bhuvana Ramabhadran, Petr Fousek, Petr Novak, and Abdel rahman Mohamed, “Making Deep Belief Networks Effective for Large Vocabulary Continuous Speech Recognition,” in Proc. ASRU, 2011.
- Jasper Snoek, Hugo Larochelle, and Ryan Prescott Adams, “Practical Bayesian optimization of machine learning algorithms,” in Neural Information Processing Systems, 2012.
- Tijmen Tieleman, “Gnumpy: an easy way to use GPU boards in Python,” Tech. Rep. UTML TR 2010-002, University of Toronto, Department of Computer Science, 2010.
- Volodymyr Mnih, “Cudamat: a CUDA-based matrix class for python,” Tech. Rep. UTML TR 2009-004, Department of Computer Science, University of Toronto, November 2009.
- Tara N. Sainath, Brian Kingsbury, and Bhuvana Ramabhadran, “Auto-Encoder Bottleneck Features Using Deep Belief Networks,” in Proc. ICASSP, 2012.