AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
In this paper we propose a shared-hidden-layer multilingual deep neural network, in which the hidden layers are made common across many languages while the softmax layers are made language dependent

Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers

ICASSP, pp.7304-7308, (2013)

Cited by: 596|Views432
EI WOS

Abstract

In the deep neural network (DNN), the hidden layers can be considered as increasingly complex feature transformations and the final softmax layer as a log-linear classifier making use of the most abstract features computed in the hidden layers. While the loglinear classifier should be different for different languages, the feature transfo...More

Code:

Data:

0
Introduction
  • The context-dependent deep neural network hidden Markov models (CD-DNN-HMMs) have outperformed the discriminatively trained conventional Gaussian mixture model (GMM) HMMs in many large vocabulary speech recognition (LVSR) tasks [1]-[11].
  • The DNN can be considered as a model that learns a complicated feature transformation and a log-linear classifier jointly [4].
  • In this paper the authors propose a shared-hidden-layer multilingual DNN (SHL-MDNN), in which the hidden layers are shared across many languages while the softmax layers are language dependent.
  • The shared hidden layers (SHLs) and the separate softmax layers are jointly optimized using a multilingual training set.
  • The authors can consider the SHLs as a universal feature transformation that works well for many languages
Highlights
  • The context-dependent deep neural network hidden Markov models (CD-DNN-HMMs) have outperformed the discriminatively trained conventional Gaussian mixture model (GMM) HMMs in many large vocabulary speech recognition (LVSR) tasks [1]-[11]
  • In this paper we propose a shared-hidden-layer multilingual DNN (SHL-MDNN), in which the hidden layers are shared across many languages while the softmax layers are language dependent
  • We proposed a shared-hidden-layer multilingual DNN architecture in which the hidden layers are shared across multiple languages and serve as universal feature transformation
  • We demonstrated that the hidden layers in the shared hidden layers (SHLs)-MDNN can be effectively transferred for use by and benefit for other languages, even if large volumes of training data are available for the target language or the target language is phonetically far from the source languages used to train the SHL-MDNN
  • It suggests the possibility to quickly build a high-performance CDDNN-HMM system for a new language from an existing multilingual DNN. This huge benefit would require a small amount of training data from the target language, having more data would further improve the performance, can completely eliminate the unsupervised pre-training stage, and can train the DNN with much fewer epochs
  • The baseline DNN is trained solely using the 9-hr ENU training set. With this approach we only achieved a word error rate (WER) of 30.9% on the ENU test set
  • Our work indicates the possibility to build a universal ASR system efficiently under the CD-DNN-HMM framework
Results
  • The authors show that the learned hidden layers sharing across languages can be transferred to improve recognition accuracy of new languages, with relative error reductions ranging from 6% to 28% against DNNs trained without exploiting the transferred hidden layers.
  • When 36 hours of ENU speech data are available, the authors got additional absolute 0.8% WER reduction (22.4% 21.6%) by adapting all layers.
  • Using only 36 hours of CHN data the authors can achieve 28.4% CER on the test set by transferring the SHLs from the SHL-MDNN
  • This is better than the 29.0% CER obtained with the baseline DNN trained using the 139 hours of CHN training data, a save of over 100 hours of CHN transcription effort
Conclusion
  • The authors proposed a shared-hidden-layer multilingual DNN architecture in which the hidden layers are shared across multiple languages and serve as universal feature transformation.
  • It suggests the possibility to quickly build a high-performance CDDNN-HMM system for a new language from an existing multilingual DNN.
  • The authors' work indicates the possibility to build a universal ASR system efficiently under the CD-DNN-HMM framework
  • Such a system can recognize many languages and improve the accuracy for each individual language, and expand the languages supported by stacking softmax layers for new languages
Tables
  • Table1: Compare Monolingual DNN and Shared-Hidden-Layer Multilingual DNN in WER (%)
  • Table2: Compare ENU WER with and without Using Hidden Layers (HLs) Transferred from the FRA DNN
  • Table3: Compare the Effect of Target Language Training Set Size in WER (%) when SHLs Are Transferred from the SHL-MDNN
  • Table4: Effectiveness of Cross-Lingual Model Transfer on CHN
  • Table5: Compare Features Learned from Multilingual Data with and without Using Label Information on ENU Data
Download tables as Excel
Funding
  • We show that the learned hidden layers sharing across languages can be transferred to improve recognition accuracy of new languages, with relative error reductions ranging from 6% to 28% against DNNs trained without exploiting the transferred hidden layers
  • By sharing the hidden layers in the SHL-MDNN and by using the joint training strategy, we can improve the recognition accuracy of all the languages decodable by the SHL-MDNN over the monolingual DNNs trained using data from individual languages only
  • As shown in Table 2 the baseline DNN is trained solely using the 9-hr ENU training set. With this approach we only achieved a WER of 30.9% on the ENU test set
  • If we fix the hidden layers and only train the ENU specific softmax layer using the 9-hr ENU training data we obtain absolute 2.6% WER reduction (30.9% 27.3%) from the baseline DNN
  • If we retrain the whole FRA DNN using the 9-hr ENU data, we got a WER of 30.6%, which is only slightly better than the 30.9% baseline WER
  • In fact we got additional absolute 2.0% WER reduction (27.3% 25.3%) by doing so
  • When 36 hours of ENU speech data are available, we got additional absolute 0.8% WER reduction (22.4% 21.6%) by adapting all layers
  • Using only 36 hours of CHN data we can achieve 28.4% CER on the test set by transferring the SHLs from the SHL-MDNN
  • Moreover, using only 36 hours of CHN data we can achieve 28.4% CER on the test set by transferring the SHLs from the SHL-MDNN. This is better than the 29.0% CER obtained with the baseline DNN trained using the 139 hours of CHN training data, a save of over 100 hours of CHN transcription effort
Reference
  • G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pre-trained deep neural networks for large vocabulary speech recognition,” IEEE Trans. Speech and Audio Proc., vol. 20, no. 1, pp. 30 – 42, 2012
    Google ScholarLocate open access versionFindings
  • D. Yu, L. Deng, and G. Dahl, “Roles of pretraining and finetuning in context-dependent DNN-HMMs for real-world speech recognition,” in Proc. NIPS Workshop on Deep Learning and Unsupervised Feature Learning, Dec. 2010.
    Google ScholarLocate open access versionFindings
  • F. Seide, G. Li, and D. Yu, “Conversational speech transcription using context-dependent deep neural networks,” in Proc. Interspeech, pp. 437-440, 2011.
    Google ScholarLocate open access versionFindings
  • F. Seide, G. Li, X. Chen, D. Yu, "Feature engineering in context-dependent deep neural networks for conversational speech transcription," in Proc. ASRU, pp. 24-29, 2011.
    Google ScholarLocate open access versionFindings
  • A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep belief networks,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 20, no. 1, pp. 14–22, 2012.
    Google ScholarLocate open access versionFindings
  • N. Jaitly, P. Nguyen, and V. Vanhoucke, “application of pretrained deep neural networks to large vocabulary speech recognition,” in Proc. Interspeech, 2012.
    Google ScholarLocate open access versionFindings
  • T. N. Sainath, B. Kingsbury, B. Ramabhadran, P. Fousek, P. Novak, and A.-r. Mohamed, “Making deep belief networks effective for large vocabulary continuous speech recognition,” in Proc. ASRU, pp. 30-35, 2011.
    Google ScholarLocate open access versionFindings
  • B. Kingsbury, T. N. Sainath, and H. Soltau, "Scalable minimum bayes risk training of deep neural network acoustic models using distributed hessian-free optimization," in Proc. Interspeech, 2012.
    Google ScholarLocate open access versionFindings
  • H. Su, G. Li, D. Yu, F. Seide, "Error back propagation for sequence training of context-dependent deep networks for conversational speech transcription", in Proc. ICASSP 2013.
    Google ScholarLocate open access versionFindings
  • M. Seltzer, D. Yu, Y. Wang, "An investigation of deep neural networks for noise robust speech recognition", in Proc. ICASSP 2013.
    Google ScholarLocate open access versionFindings
  • G. Hinton, L. Deng, D. Yu, G. Dahl, A.-R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kings- bury, “Deep neural networks for acoustic modeling in speech recognition,” IEEE Signal Processing Magazine, 2012.
    Google ScholarLocate open access versionFindings
  • R. Caruana, “Multitask Learning,” Machine Learning, Vol. 28, pp. 41-75, Kluwer Academic Publishers, 1997
    Google ScholarLocate open access versionFindings
  • Q. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. Corrado, J. Dean, A. Ng, “Building high-level features using large scale unsupervised learning,” International Conference in Machine Learning, 2012
    Google ScholarLocate open access versionFindings
  • P. Swietojanski, A. Ghoshal, S. Renals, "Unsupervised crosslingual knowledge transfer in DNN-based LVCSR," in Proc. SLT 2012.
    Google ScholarLocate open access versionFindings
  • T. Schultz and A. Waibel, “Language independent and language adaptive acoustic modeling for speech recognition,” in Speech Communication, August 2001, Volume 35, Issue 12, pp. 31-51
    Google ScholarFindings
  • H. Lin, L. Deng, D. Yu, Y. Gong, A. Acero, and C-H Lee, “A study on multilingual acoustic modeling for large vocabulary ASR,” in Proc. ICASSP, pp. 4333–4336, 2009
    Google ScholarLocate open access versionFindings
  • T. Niesler, “Language-dependent state clustering for multilingual acoustic modeling,” Speech Communication, vol. 49, 2007
    Google ScholarLocate open access versionFindings
  • D. Yu, L. Deng, P. Liu, J. Wu, Y. Gong, A. Acero, "crosslingual speech recognition under runtime resource constraints," in Proc. ICASSP, pp. 4193-4196, 2009
    Google ScholarLocate open access versionFindings
  • L. Burget et al, “Multilingual Acoustic Modeling for Speech Recognition Based on Subspace Gaussian Mixture Models,” in Proc. ICASSP, Dallas, 2010
    Google ScholarLocate open access versionFindings
  • A. Stolcke, F. Grzl, M-Y Hwang, X. Lei, N. Morgan, D. Vergyri, “Cross-domain and cross-lingual portability of acoustic features estimated by multilayer perceptrons,” in Proc. ICASSP, 2006
    Google ScholarLocate open access versionFindings
  • S. Thomas, S. Ganapathy and H. Hermansky, “Cross-lingual and Multi-stream Posterior Features for Low Resource LVCSR Systems,” in Proc. Interspeech, 2010
    Google ScholarLocate open access versionFindings
  • C. Plahl, R. Schlueter and H. Ney, “Cross-lingual portability of Chinese and English neural network features for French and German LVCSR,” in Proc. ASRU, USA, 2011
    Google ScholarLocate open access versionFindings
  • N. Vu, W. Breiter, F. Metze, T. Schultz, “An investigation on initialization schemes for multilayer perceptron training using multilingual data and their effect on ASR performance,” in Proc. Interspeech, 2012
    Google ScholarLocate open access versionFindings
  • S. Thomas, S. Ganapathy and H. Hermansky, “Multilingual MLP features for low-resource LVCSR systems,” in Proc. ICASSP, 2012
    Google ScholarLocate open access versionFindings
  • R. Collobert and J. Weston, "A unified architecture for natural language processing: deep neural networks with multitask learning," in International Conference in Machine Learning, 2008.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科