This paper introduces a new corpus of read English speech. suitable for training and evaluating speech recognition systems
Librispeech: An Asr Corpus Based On Public Domain Audio Books
2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), pp.5206-5210, (2015)
This paper introduces a new corpus of read English speech, suitable for training and evaluating speech recognition systems. The LibriSpeech corpus is derived from audiobooks that are part of the LibriVox project, and contains 1000 hours of speech sampled at 16 kHz. We have made the corpus freely available for download, along with separate...更多
下载 PDF 全文
- The rapid increase in the amount of multimedia content on the In ternet in recent years makes it feasible to automatically collect data for the purpose of training statistical models.
- This is true when the source data is already organized into well curated,machine readable collections.
- The volunteer-supported speech-gathering effort Voxforge3, on which the acoustic models the authors used for align ment were trained,contains a certain amount of LibriVox audio,but the dataset is much smaller than the one the authors present here,with around 100 hours of English speech,and suffers from major gender and per speaker duration imbalances
- The rapid increase in the amount of multimedia content on the In ternet in recent years makes it feasible to automatically collect data for the purpose of training statistical models
- The use of audio books for building synthetic voices [1, 2] has previously been investigated, we are not aware of any freely available read speech corpus in English that is suitable for training and testing speech recognition systems,and which is as large scale as the one we present here
- The volunteer-supported speech-gathering effort Voxforge3, on which the acoustic models we used for align ment were trained,contains a certain amount of LibriVox audio,but the dataset is much smaller than the one we present here,with around 100 hours of English speech,and suffers from major gender and per speaker duration imbalances
- Fi nally in Section 5 we present experimental results on models trained on this data set, using both the LibriSpeech dev and test sets and Wall Street Journal (WSJ)  test sets
- We have automatically aligned and segmented English read speech from audiobooks with the corresponding book text, and filtered out segments with noisy transcripts,in order to produce a corpus of En glish read speech suitable for training speech recognition systems
- We have demonstrated that models trained with our corpus do better on the standard Wall Street Journal (WSJ) test sets than models built on Wall Street Journal itself - the larger size of our corpus (1000 hours,versus the 82 hours of Wall Street Journal's si-284 data) outweighs the audio mismatch
- The authors present decoding results using models trained using various amounts of LibriSpeech data, and on WSJ data, on both LibriSpeech and WSJ test sets.
- The authors employ language models,trained on the text material the WSJ corpus provides,in conjunction with acoustic models trained on the LibriSpeech data to decode WSJ's test sets, and compare the results with those for state-of-the-art models trained on WSJ's own si-284 set.
- The WSJ results the authors present in Table 2 are for the "open-vocabulary " (60K) test condition, using not the standard 60K word dictionary supplied with WSJ but an extended version that the authors built to cover more of the words that appear in the WSJ language models.
- The models marked with 460h are trained on the union of the "train-clean-lOO " and "train-clean-360 " subsets, and those marked with 960h are trained on all of LibriSpeech's training sets
- The authors have automatically aligned and segmented English read speech from audiobooks with the corresponding book text, and filtered out segments with noisy transcripts,in order to produce a corpus of En glish read speech suitable for training speech recognition systems.
- The authors have demonstrated that models trained with the corpus do better on the standard Wall Street Journal (WSJ) test sets than models built on WSJ itself - the larger size of the corpus (1000 hours,versus the 82 hours of WSJ's si-284 data) outweighs the audio mismatch
- The authors are releasing this corpus onlinell and have introduced scripts into the Kaldi speech recognition toolkit so that others can repli cate these results
- Table1: Data subsets in LibriSpeech
- Table2: WERs on WSJ's test sets under the "open vocabulary " (60K) test condition
- Table3: WERs on LibriSpeech's test sets; all results are obtained by rescoring with a 4-gram language model
- Table4: LM rescoring results for the 960 hour DNN model
- [I] K. Prahallad, Automatic building of synthetic voicesfrom audio books, Ph.D. thesis,CMU, Pittsburgh, 2010.
- S. King and V. Karaiskos, "The Blizzard Challenge 2012," in Proceedings Blizzard Workshop, 2012.
- "Creative Commons Attribution 4.0 International Public License," https://creativecommons.org/licenses/by/4.0/, November 2013.
- D. Povey, A. Ghoshal, et aI., "The Kaldi Speech Recognition Toolkit," in Proc. ASRU, 2011.
- D. B. Paul aud J. M. Baker, "The design for the Wall Street Journal-based CSR corpus," in Proceedings ofthe workshop on Speech and Natural Language. Association for Computational Linguistics,1992,pp. 357-362.
- T J. Hazen, "Automatic alignment aud error correction of hu mau generated trauscripts for long speech recordings," in in Proc. interspeech, 2006.
- X. Anguera, J. Luque, aud C. Gracia, "Audio-to-text align ment for speech recognition with very limited resources," in interspeech, 2014.
- R. Sproat et aI., "Normalization of non-staudard words," Com puter Speech & Language, vol. 15, no. 3, pp. 287-333, 2001.
- A. Stolcke, "SRILM - An Extensible Lauguage Modeling Toolkit," in iCSLP, 2002.
- I.H. Witten and TC. Bell, 'The zero-frequency problem: Es timating the probabilities of novel events in adaptive text com pression," IEEE Transactions on information Theory, vol. 37, no. 4,1991.
- M. Bisani and H. Ney, "Joint-sequence models for grapheme to-phoneme conversion.," Speech Communication, vol. 50, no. 5,pp. 434--451,2008.
- D. Povey aud D. Kauevsky aud B. Kingsbury aud B. Ramab hadrau aud G. Saon aud K. Visweswariah, "Boosted MMI for Feature and Model Space Discriminative Training," in iCASSP, 2008.
- S. Davis aud P. Mermelstein, "Comparison of parametric repre sentations for monosyllabic word recognition in continuously spoken sentences," Acoustics, Speech and Signal Processing, iEEE Transactions on, vol. 28,no. 4,pp. 357-366,1980.
- M.J.F. Gales, "Semi-tied covariauce matrices for hidden markov models," IEEE Transactions on Speech and Audio Pro cessing, vol. 7,pp. 272-281,1999.
- T Smith and M. Waterman, "Identification of common molec ular subsequences," Journal of Molecular Biology, vol. 147, no. I, pp. 195-197, 1981.
- v.I. Levenshtein, "Binary Codes Capable of Correcting Dele tions, Insertions aud Reversals," Soviet Physics Doklady, vol. 10,pp. 707,1966.
- N. Braunschweiler,M. J. F. Gales, and S. Buchholz, "Lightly supervised recognition for automatic alignment of large coher ent speech recordings.," in lNTERSPEECH. 2010, pp. 22222225,ISCA.
- M. J. F. Gales and P. C. Woodland, "Mean and Variance Adap tation Within the MLLR Framework," Computer Speech and Language, vol. 10,pp. 249-264, 1996.
- T Anastasakos,J. McDonough, R. Schwartz,and J. Makhoul, "A Compact Model for Speaker-Adaptive Training," in iCSLP, 1996.
- S. Meignier and T Merlin, "UUM SpkDiarization: an open source toolkit for diarization," in CMU SPUD Workshop, Dal las (Texas,U SA),March 2010.
- R. Kneser aud H. Ney,"Improved backing-off for m-gram lau guage modeling," in iCASSP, 1995, vol. 1,pp. 181-184.
- S. F. Chen and J. Goodman, "An empirical study of smoothing techniques for lauguage modeling," in Proceedings of the 34th annual meeting on Association for Computational Linguistics. Association for Computational Linguistics,1996,pp. 310-318.
- X. Zhaug,J. Trmal,D. Povey,aud S. Khudaupur, "Improving deep neural network acoustic models using generalized maxout networks," in iEEE international Conference on Acoustics, Speech and Signal Processing, iCASSP2014, Florence, italy, May 4-9,2014,2014,pp. 215-219.