Toward Human Parity in Conversational Speech Recognition
IEEE/ACM Transactions on Audio, Speech, and Language Processing, pp. 2410-2423, 2017.
In the area of speech recognition, much of the pioneering early work was driven by a series of carefully designed tasks with DARPA-funded data sets publicly released by the Linguistic Data Consortium and the National Institute of Standards and Technology
Conversational speech recognition has served as a flagship speech recognition task since the release of the Switchboard corpus in the 1990s. In this paper, we measure a human error rate on the widely used NIST 2000 test set for commercial bulk transcription. The error rate of professional transcribers is 5.9% for the Switchboard portion o...More
PPT (Upload PPT)
- Recent years have seen human performance levels reached or surpassed in tasks ranging from the games of chess and Go ,  to simple speech recognition tasks like carefully read newspaper speech  and rigidly constrained small-vocabulary tasks in noise , .
- The current state of the art is exceeding the human error level in the training data for the underlying models.
- We found that using networks with more than six layers did not improve the word error rate on the development set, and chose 512 hidden units, per direction, per layer, as that provided a reasonable trade-off between training time and final model accuracy.
- The overall effect of this process is to make the training algorithm prefer models that have correlated neurons, and to improve the word error rate of the acoustic model.
- After obtaining good results with RNN-LMs we explored the LSTM recurrent network architecture for language modeling, inspired by recent work showing gains over RNNLMs for conversational speech recognition .
- The 4-gram language model for decoding was trained on the available CTS transcripts from the DARPA EARS program: Switchboard (3M words), BBN Switchboard-2 transcripts (850k), Fisher (21M), English CallHome (200k), and the University of Washington conversational Web corpus (191M).
- The RNN and LSTM LMs were trained on Switchboard and Fisher transcripts as in-domain data (20M words for gradient computation, 3M for validation).
- The total gain relative to a purely Ngram based system is a 20% relative error reduction with RNN-LMs, and 23% with LSTM-LMs. As shown later the gains with different acoustic models are similar.
- The performance of all our component models is shown in Table VIII, along with the BLSTM combination, the full system combination results, and the measured human transcriber
- TABLE VIII WORD ERROR RATES (%) ON THE NIST 2000 CTS TEST SET WITH DIFFERENT ACOUSTIC MODELS, AND HUMAN ERROR RATE FOR COMPARISON.
- On the language modeling side, we achieve a performance boost by combining multiple LSTM-LMs in both forward and backward directions, and by using a two-phase training regimen to get best results from out-of-domain data.
- For our best CNN system, LSTM-LM rescoring yields a relative word error reduction of 23%, and a 20% relative gain for the combined recognition system, considerably larger than previously reported for conversational speech recognition .
- The same speakers tend to be relative easy or hard to recognize for humans and machines, and the same kinds of short function words tend to be substituted, deleted or inserted in errors.
- For the Switchboard genre our results support the conclusion that state-of-the-art speech recognition technology can reach a level that is comparable to human in both quantitative and qualitative terms, when given sufficient and matched training data.
- Inspired by the human auditory cortex , where neighboring neurons tend to simultaneously activate, we employ a spatial smoothing technique to improve the accuracy of our LSTM models
- On the language modeling side, we achieve a performance boost by combining multiple LSTM-LMs in both forward and backward directions, and by using a two-phase training regimen to get best results from out-of-domain data