AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
In this paper, motivated by successes in computer vision, we have explored using rectified linear units and dropout in deep neural nets for large vocabulary continuous speech recognition for the first time

Improving deep neural networks for LVCSR using rectified linear units and dropout.

ICASSP, pp.8609-8613, (2013)

引用1416|浏览572
EI WOS
下载 PDF 全文
引用
微博一下

摘要

Recently, pre-trained deep neural networks (DNNs) have outperformed traditional acoustic models based on Gaussian mixture models (GMMs) on a variety of large vocabulary speech recognition benchmarks. Deep neural nets have also achieved excellent results on various computer vision tasks using a random "dropout" procedure that drastically i...更多

代码

数据

简介
  • Up until a few years ago, most state of the art speech recognition systems were based on hidden Markov models (HMMs) that used mixtures of Gaussians to model the HMM emission distributions.
  • [1] showed that hybrid acoustic models that replaced Gaussian mixture models (GMMs) with pretrained, deep neural networks (DNNs) could drastically improve performance on a small-scale phone recognition task, results that were later extended to a large vocabulary voice search task in [2].
  • Dropout is a technique for avoiding overfitting in neural networks that has been highly effective on non-speech tasks and the small-scale TIMIT phone recognition task [4], it can increase training time.
  • Rectified linear units are a natural choice to combine with dropout for LVCSR
重点内容
  • Up until a few years ago, most state of the art speech recognition systems were based on hidden Markov models (HMMs) that used mixtures of Gaussians to model the hidden Markov models emission distributions
  • [1] showed that hybrid acoustic models that replaced Gaussian mixture models (GMMs) with pretrained, deep neural networks (DNNs) could drastically improve performance on a small-scale phone recognition task, results that were later extended to a large vocabulary voice search task in [2]
  • We show that the modified deep neural networks (DNNs) using rectified linear unit and dropout provide a 4.2% relative error reduction over a standard pretrained deep neural networks and a 14.4% relative improvement over a strong Gaussian mixture models-hidden Markov models system
  • In this paper, motivated by successes in computer vision, we have explored using rectified linear units and dropout in deep neural nets for large vocabulary continuous speech recognition for the first time
  • Given the gains from dropout and rectified linear unit for frame level training, determining the best way to exploit dropout in full sequence training is an exciting direction for future work
  • Fixing the units that are dropped out during each conjugate gradient run within Hessian-free optimizer might be sufficient to allow dropout to be combined with Hessian-free optimizer, but to date all uses of dropout in the literature have been with first order optimization algorithms
方法
  • Temporal context is included by splicing 9 successive frames of PLP features into supervectors, projecting to 40 dimensions using linear discriminant analysis (LDA).
  • The feature space is further diagonalized using a global semi-tied covariance (STC) transform.
  • The GMMs are speaker-adaptively trained, with a feature-space maximum likelihood linear transform estimated per speaker in training and testing.
  • Following maximum-likelihood training of the GMMs, featurespace discriminative training and model-space discriminative training are done using the boosted maximum mutual information (BMMI) criterion.
  • At test time, unsupervised adaptation using regression tree MLLR is performed.
  • The GMMs use 2,203 quinphone states and 150K diagonalcovariance Gaussians
结果
  • As shown in the table, the single best model uses ReLUs, dropout, and relatively large hidden layers with weights initialized using unsupervised RBM pre-training
  • This model achieved a word error rate of 18.5 on the test set, beating other deep neural net models and the strong discriminatively trained GMM baseline, even before any full sequence training of the neural net models.
结论
  • CONCLUSIONS AND FUTURE

    WORK

    In this paper, motivated by successes in computer vision, the authors have explored using rectified linear units and dropout in deep neural nets for LVCSR for the first time.
  • ReLUs and dropout yielded word error rate improvements relative to a state of the art baseline and, with the help of Bayesian optimization software, the authors were able to obtain these improvements without much of the hand tuning typically used to obtain the very best deep neural net results.
  • It is a promising alternative the authors hope to explore
表格
  • Table1: Results without full sequence training (with cross entropy a.k.a CE). All models used pre-training unless “no PT” was specified and used 1k,2k, or 3k hidden units per layer
  • Table2: Results with full sequence training
Download tables as Excel
基金
  • Shows on a 50-hour English Broadcast News task that modified deep neural networks using ReLUs trained with dropout during frame level training provide an 4.2% relative improvement over a DNN trained with sigmoid units, and a 14.4% relative improvement over a strong GMM/HMM system
  • Explores the behavior of deep neural nets using ReLUs and dropout on a 50-hour broadcast news task , focussing our experiments on using dropout during the frame level training phase
  • Shows that the modified deep neural networks using ReLUs and dropout provide a 4.2% relative error reduction over a standard pretrained DNN and a 14.4% relative improvement over a strong GMM-HMM system
  • Describes the dropout method, while in Section 3 Rectified Linear units are presented
  • Found it useful to only do 2.5 epochs of pre-training for each layer compared to more than twice that for sigmoid units
引用论文
  • Abdel rahman Mohamed, George E. Dahl, and Geoffrey E. Hinton, “Acoustic modeling using deep belief networks,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 20, no. 1, pp. 14 –22, jan. 2012.
    Google ScholarLocate open access versionFindings
  • Geore E. Dahl, Dong Yu, Li Deng, and Alex Acero, “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 20, no. 1, pp. 30 –42, jan. 2012.
    Google ScholarLocate open access versionFindings
  • Geoffrey E. Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, and Brian Kingsbury, “Deep neural networks for acoustic modeling in speech recognition,” Signal Processing Magazine, 2012.
    Google ScholarLocate open access versionFindings
  • Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” The Computing Research Repository (CoRR), vol. abs/1207.0580, 2012.
    Google ScholarLocate open access versionFindings
  • Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Neural Information Processing Systems, 2012.
    Google ScholarLocate open access versionFindings
  • Brian Kingsbury, Tara N. Sainath, and Hagen Soltau, “Scalable Minimum Bayes Risk Training of Deep Neural Network Acoustic Models Using Distributed Hessian-free Optimization,” in Proc. Interspeech, 2012.
    Google ScholarLocate open access versionFindings
  • Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol, “Extracting and composing robust features with denoising autoencoders,” in Proceedings of the Twenty-fifth International Conference on Machine Learning (ICML’08), William W. Cohen, Andrew McCallum, and Sam T. Roweis, Eds. 2008, pp. 1096–1103, ACM.
    Google ScholarLocate open access versionFindings
  • Kevin Jarrett, Koray Kavukcuoglu, Marc’Aurelio Ranzato, and Yann LeCun, “What is the best multi-stage architecture for object recognition?,” in Proc. International Conference on Computer Vision (ICCV’09). 2009, pp. 2146–2153, IEEE.
    Google ScholarLocate open access versionFindings
  • Vinod Nair and Geoffrey E. Hinton, “Rectified linear units improve restricted Boltzmann machines,” in Proceedings of the 27th International Conference on Machine Learning (ICML-10), Johannes Furnkranz and Thorsten Joachims, Eds., Haifa, Israel, June 2010, pp. 807–814, Omnipress.
    Google ScholarLocate open access versionFindings
  • Navdeep Jaitly and Geoffrey E. Hinton, “Learning a better representation of speech soundwaves using restricted Boltzmann machines,” in ICASSP, 2011, pp. 5884–5887.
    Google ScholarFindings
  • Hagen Soltau, George Saon, and Brian Kingsbury, “The IBM Attila Speech Recognition Toolkit,” in Proc. SLT, 2010.
    Google ScholarLocate open access versionFindings
  • Tara N. Sainath, Brian Kingsbury, Bhuvana Ramabhadran, Petr Fousek, Petr Novak, and Abdel rahman Mohamed, “Making Deep Belief Networks Effective for Large Vocabulary Continuous Speech Recognition,” in Proc. ASRU, 2011.
    Google ScholarLocate open access versionFindings
  • Jasper Snoek, Hugo Larochelle, and Ryan Prescott Adams, “Practical Bayesian optimization of machine learning algorithms,” in Neural Information Processing Systems, 2012.
    Google ScholarLocate open access versionFindings
  • Tijmen Tieleman, “Gnumpy: an easy way to use GPU boards in Python,” Tech. Rep. UTML TR 2010-002, University of Toronto, Department of Computer Science, 2010.
    Google ScholarFindings
  • Volodymyr Mnih, “Cudamat: a CUDA-based matrix class for python,” Tech. Rep. UTML TR 2009-004, Department of Computer Science, University of Toronto, November 2009.
    Google ScholarFindings
  • Tara N. Sainath, Brian Kingsbury, and Bhuvana Ramabhadran, “Auto-Encoder Bottleneck Features Using Deep Belief Networks,” in Proc. ICASSP, 2012.
    Google ScholarLocate open access versionFindings
0
您的评分 :

暂无评分

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn