AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
This paper introduces a new corpus of read English speech. suitable for training and evaluating speech recognition systems

Librispeech: An Asr Corpus Based On Public Domain Audio Books

2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), pp.5206-5210, (2015)

引用2810|浏览557
EI
下载 PDF 全文
引用
微博一下

摘要

This paper introduces a new corpus of read English speech, suitable for training and evaluating speech recognition systems. The LibriSpeech corpus is derived from audiobooks that are part of the LibriVox project, and contains 1000 hours of speech sampled at 16 kHz. We have made the corpus freely available for download, along with separate...更多

代码

数据

简介
  • The rapid increase in the amount of multimedia content on the In­ ternet in recent years makes it feasible to automatically collect data for the purpose of training statistical models.
  • This is true when the source data is already organized into well curated,machine readable collections.
  • The volunteer-supported speech-gathering effort Voxforge3, on which the acoustic models the authors used for align­ ment were trained,contains a certain amount of LibriVox audio,but the dataset is much smaller than the one the authors present here,with around 100 hours of English speech,and suffers from major gender and per­ speaker duration imbalances
重点内容
  • The rapid increase in the amount of multimedia content on the In­ ternet in recent years makes it feasible to automatically collect data for the purpose of training statistical models
  • The use of audio books for building synthetic voices [1, 2] has previously been investigated, we are not aware of any freely available read speech corpus in English that is suitable for training and testing speech recognition systems,and which is as large scale as the one we present here
  • The volunteer-supported speech-gathering effort Voxforge3, on which the acoustic models we used for align­ ment were trained,contains a certain amount of LibriVox audio,but the dataset is much smaller than the one we present here,with around 100 hours of English speech,and suffers from major gender and per­ speaker duration imbalances
  • Fi­ nally in Section 5 we present experimental results on models trained on this data set, using both the LibriSpeech dev and test sets and Wall Street Journal (WSJ) [5] test sets
  • We have automatically aligned and segmented English read speech from audiobooks with the corresponding book text, and filtered out segments with noisy transcripts,in order to produce a corpus of En­ glish read speech suitable for training speech recognition systems
  • We have demonstrated that models trained with our corpus do better on the standard Wall Street Journal (WSJ) test sets than models built on Wall Street Journal itself - the larger size of our corpus (1000 hours,versus the 82 hours of Wall Street Journal's si-284 data) outweighs the audio mismatch
方法
  • The authors present decoding results using models trained using various amounts of LibriSpeech data, and on WSJ data, on both LibriSpeech and WSJ test sets.
  • The authors employ language models,trained on the text material the WSJ corpus provides,in conjunction with acoustic models trained on the LibriSpeech data to decode WSJ's test sets, and compare the results with those for state-of-the-art models trained on WSJ's own si-284 set.
  • The WSJ results the authors present in Table 2 are for the "open-vocabulary " (60K) test condition, using not the standard 60K word dictionary supplied with WSJ but an extended version that the authors built to cover more of the words that appear in the WSJ language models.
  • The models marked with 460h are trained on the union of the "train-clean-lOO " and "train-clean-360 " subsets, and those marked with 960h are trained on all of LibriSpeech's training sets
结论
  • The authors have automatically aligned and segmented English read speech from audiobooks with the corresponding book text, and filtered out segments with noisy transcripts,in order to produce a corpus of En­ glish read speech suitable for training speech recognition systems.
  • The authors have demonstrated that models trained with the corpus do better on the standard Wall Street Journal (WSJ) test sets than models built on WSJ itself - the larger size of the corpus (1000 hours,versus the 82 hours of WSJ's si-284 data) outweighs the audio mismatch
  • The authors are releasing this corpus onlinell and have introduced scripts into the Kaldi speech recognition toolkit so that others can repli­ cate these results
表格
  • Table1: Data subsets in LibriSpeech
  • Table2: WERs on WSJ's test sets under the "open vocabulary " (60K) test condition
  • Table3: WERs on LibriSpeech's test sets; all results are obtained by rescoring with a 4-gram language model
  • Table4: LM rescoring results for the 960 hour DNN model
Download tables as Excel
引用论文
  • [I] K. Prahallad, Automatic building of synthetic voicesfrom audio books, Ph.D. thesis,CMU, Pittsburgh, 2010.
    Google ScholarFindings
  • S. King and V. Karaiskos, "The Blizzard Challenge 2012," in Proceedings Blizzard Workshop, 2012.
    Google ScholarLocate open access versionFindings
  • "Creative Commons Attribution 4.0 International Public License," https://creativecommons.org/licenses/by/4.0/, November 2013.
    Findings
  • D. Povey, A. Ghoshal, et aI., "The Kaldi Speech Recognition Toolkit," in Proc. ASRU, 2011.
    Google ScholarLocate open access versionFindings
  • D. B. Paul aud J. M. Baker, "The design for the Wall Street Journal-based CSR corpus," in Proceedings ofthe workshop on Speech and Natural Language. Association for Computational Linguistics,1992,pp. 357-362.
    Google ScholarLocate open access versionFindings
  • T J. Hazen, "Automatic alignment aud error correction of hu­ mau generated trauscripts for long speech recordings," in in Proc. interspeech, 2006.
    Google ScholarLocate open access versionFindings
  • X. Anguera, J. Luque, aud C. Gracia, "Audio-to-text align­ ment for speech recognition with very limited resources," in interspeech, 2014.
    Google ScholarLocate open access versionFindings
  • R. Sproat et aI., "Normalization of non-staudard words," Com­ puter Speech & Language, vol. 15, no. 3, pp. 287-333, 2001.
    Google ScholarLocate open access versionFindings
  • A. Stolcke, "SRILM - An Extensible Lauguage Modeling Toolkit," in iCSLP, 2002.
    Google ScholarLocate open access versionFindings
  • I.H. Witten and TC. Bell, 'The zero-frequency problem: Es­ timating the probabilities of novel events in adaptive text com­ pression," IEEE Transactions on information Theory, vol. 37, no. 4,1991.
    Google ScholarLocate open access versionFindings
  • M. Bisani and H. Ney, "Joint-sequence models for grapheme­ to-phoneme conversion.," Speech Communication, vol. 50, no. 5,pp. 434--451,2008.
    Google ScholarLocate open access versionFindings
  • D. Povey aud D. Kauevsky aud B. Kingsbury aud B. Ramab­ hadrau aud G. Saon aud K. Visweswariah, "Boosted MMI for Feature and Model Space Discriminative Training," in iCASSP, 2008.
    Google ScholarFindings
  • S. Davis aud P. Mermelstein, "Comparison of parametric repre­ sentations for monosyllabic word recognition in continuously spoken sentences," Acoustics, Speech and Signal Processing, iEEE Transactions on, vol. 28,no. 4,pp. 357-366,1980.
    Google ScholarLocate open access versionFindings
  • M.J.F. Gales, "Semi-tied covariauce matrices for hidden markov models," IEEE Transactions on Speech and Audio Pro­ cessing, vol. 7,pp. 272-281,1999.
    Google ScholarLocate open access versionFindings
  • T Smith and M. Waterman, "Identification of common molec­ ular subsequences," Journal of Molecular Biology, vol. 147, no. I, pp. 195-197, 1981.
    Google ScholarLocate open access versionFindings
  • v.I. Levenshtein, "Binary Codes Capable of Correcting Dele­ tions, Insertions aud Reversals," Soviet Physics Doklady, vol. 10,pp. 707,1966.
    Google ScholarLocate open access versionFindings
  • N. Braunschweiler,M. J. F. Gales, and S. Buchholz, "Lightly supervised recognition for automatic alignment of large coher­ ent speech recordings.," in lNTERSPEECH. 2010, pp. 22222225,ISCA.
    Google ScholarLocate open access versionFindings
  • M. J. F. Gales and P. C. Woodland, "Mean and Variance Adap­ tation Within the MLLR Framework," Computer Speech and Language, vol. 10,pp. 249-264, 1996.
    Google ScholarLocate open access versionFindings
  • T Anastasakos,J. McDonough, R. Schwartz,and J. Makhoul, "A Compact Model for Speaker-Adaptive Training," in iCSLP, 1996.
    Google ScholarLocate open access versionFindings
  • S. Meignier and T Merlin, "UUM SpkDiarization: an open source toolkit for diarization," in CMU SPUD Workshop, Dal­ las (Texas,U SA),March 2010.
    Google ScholarFindings
  • R. Kneser aud H. Ney,"Improved backing-off for m-gram lau­ guage modeling," in iCASSP, 1995, vol. 1,pp. 181-184.
    Google ScholarLocate open access versionFindings
  • S. F. Chen and J. Goodman, "An empirical study of smoothing techniques for lauguage modeling," in Proceedings of the 34th annual meeting on Association for Computational Linguistics. Association for Computational Linguistics,1996,pp. 310-318.
    Google ScholarLocate open access versionFindings
  • X. Zhaug,J. Trmal,D. Povey,aud S. Khudaupur, "Improving deep neural network acoustic models using generalized maxout networks," in iEEE international Conference on Acoustics, Speech and Signal Processing, iCASSP2014, Florence, italy, May 4-9,2014,2014,pp. 215-219.
    Google ScholarLocate open access versionFindings
0
您的评分 :

暂无评分

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn