Multiframe deep neural networks for acoustic modeling

ICASSP, pp. 7582-7585, 2013.

Cited by: 49|Bibtex|Views110|DOI:https://doi.org/10.1109/ICASSP.2013.6639137
EI WOS
Other Links: dblp.uni-trier.de|academic.microsoft.com
Weibo:
This paper describes a method of tying the neural network parameters over time which achieves comparable performance to the typical frame-synchronous model, while achieving up to a 4X reduction in the computational cost of the neural network activations

Abstract:

Deep neural networks have been shown to perform very well as acoustic models for automatic speech recognition. Compared to Gaussian mixtures however, they tend to be very expensive computationally, making them challenging to use in real-time applications. One key advantage of such neural networks is their ability to learn from very long o...More

Code:

Data:

0
Introduction
  • Deep neural networks have been shown to perform very well as acoustic models for automatic speech recognition.
  • This paper introduces another approach which takes advantage of the stationarity of the speech signal, and ties neural network parameters across frames, enabling the acoustic model to be run at a reduced frame rate.
  • Rather than separating the model description from the experiments, the authors will use the experiments to guide the rationale behind the approach: Section 2 describes the baseline system and shows the performance/complexity tradeoff of a typical frame-synchronous acoustic model.
Highlights
  • Deep neural networks have been shown to perform very well as acoustic models for automatic speech recognition
  • One might have to consider limiting the size of the networks or exploring alternative architectures. This paper introduces another approach which takes advantage of the stationarity of the speech signal, and ties neural network parameters across frames, enabling the acoustic model to be run at a reduced frame rate
  • Deep neural networks (DNNs) have become increasingly popular for acoustic modeling [1]. They make it possible to effectively use many more parameters than typical Gaussian mixture models (GMMs) in several ways: 1. use a large number of shared parameters across states: while GMM parameters are only exercised when their associated state is active, DNN parameters up to the last hidden layers are shared across all states [2], 2. use wider windows of context: while GMM systems rarely benefit from using more than 10 frames (100 ms) of context around the central frame, DNNs benefit from 20 (200 ms) and up to 40 (400 ms), 3. use a larger number of output states: it has been observed that DNN systems can typically take advantage of a much larger number of output states than comparable GMM systems
  • Since the alignments used to train the networks are inherently noisy, one can expect the neural networks to be very robust to alignments being off by several frames. This works surprisingly well, as illustrated in Figure 4: the graph depicts the performance of systems running the acoustic model at 1/2 and 1/4 the frame rate compared to frame-synchronous models of the same complexity
  • The resulting architecture is depicted in Figure 5: the DNN has the same topology as our baseline system, but in addition to a softmax regression layer that predicts the frame label at time t, it has an output layer trained jointly for labels t − 1 up until t − K
Results
  • The authors trained a collection of systems of various complexities on two datasets: US English Voice Search [5] and voice typing, and Iberian Portuguese Voice Search, by varying the width of the hidden layers of the DNN acoustic model.
  • The traditional approach is depicted in Figure 2: overlapping stacked frames are passed to the neural network to issue a prediction synchronously at every frame.
  • This works surprisingly well, as illustrated in Figure 4: the graph depicts the performance of systems running the acoustic model at 1/2 and 1/4 the frame rate compared to frame-synchronous models of the same complexity.
  • It is interesting to note how well it performs in the context of a DNN on a very large task, and better so as the acoustic model and training data get larger.
  • Computing acoustic scores every 4 frames did yield a better operating point on English, but not on Portuguese.
  • Since the last layer of a DNN can be computed on-demand at decoding time and scores can be batched [3], there are fewer efficiency gains to be obtained from running the final layer of the DNN at a lower frame-rate.
  • This suggests that training a DNN which shares all its hidden parameters, but uses frame-synchronous output layers might be a good tradeoff.
  • The resulting architecture is depicted in Figure 5: the DNN has the same topology as the baseline system, but in addition to a softmax regression layer that predicts the frame label at time t, it has an output layer trained jointly for labels t − 1 up until t − K.
Conclusion
  • Training such DNNs can be performed by backpropagating the errors from both output layers jointly through the network, taking into consideration that due to the increased gradient magnitudes, the overall learning rate might have to be reduced.
  • This demonstrates that the multiframe prediction architecture can compete with frame-synchronous systems with far fewer parameters.
  • “Deep neural networks for acoustic modeling in speech recognition,” Signal Processing Magazine, 2012
Summary
  • Deep neural networks have been shown to perform very well as acoustic models for automatic speech recognition.
  • This paper introduces another approach which takes advantage of the stationarity of the speech signal, and ties neural network parameters across frames, enabling the acoustic model to be run at a reduced frame rate.
  • Rather than separating the model description from the experiments, the authors will use the experiments to guide the rationale behind the approach: Section 2 describes the baseline system and shows the performance/complexity tradeoff of a typical frame-synchronous acoustic model.
  • The authors trained a collection of systems of various complexities on two datasets: US English Voice Search [5] and voice typing, and Iberian Portuguese Voice Search, by varying the width of the hidden layers of the DNN acoustic model.
  • The traditional approach is depicted in Figure 2: overlapping stacked frames are passed to the neural network to issue a prediction synchronously at every frame.
  • This works surprisingly well, as illustrated in Figure 4: the graph depicts the performance of systems running the acoustic model at 1/2 and 1/4 the frame rate compared to frame-synchronous models of the same complexity.
  • It is interesting to note how well it performs in the context of a DNN on a very large task, and better so as the acoustic model and training data get larger.
  • Computing acoustic scores every 4 frames did yield a better operating point on English, but not on Portuguese.
  • Since the last layer of a DNN can be computed on-demand at decoding time and scores can be batched [3], there are fewer efficiency gains to be obtained from running the final layer of the DNN at a lower frame-rate.
  • This suggests that training a DNN which shares all its hidden parameters, but uses frame-synchronous output layers might be a good tradeoff.
  • The resulting architecture is depicted in Figure 5: the DNN has the same topology as the baseline system, but in addition to a softmax regression layer that predicts the frame label at time t, it has an output layer trained jointly for labels t − 1 up until t − K.
  • Training such DNNs can be performed by backpropagating the errors from both output layers jointly through the network, taking into consideration that due to the increased gradient magnitudes, the overall learning rate might have to be reduced.
  • This demonstrates that the multiframe prediction architecture can compete with frame-synchronous systems with far fewer parameters.
  • “Deep neural networks for acoustic modeling in speech recognition,” Signal Processing Magazine, 2012
Tables
  • Table1: Word error rates (%) for neural networks trained as multiframe predictors. Multiframe acoustic model’s hidden activations are computed every 2 or 4 frames, resulting in complexities approximately equivalent to 1/2 to 1/4 of the frame synchronous model complexity
Download tables as Excel
Reference
  • We evaluated the performance of the approach on a serverbased recognizer running a 7-layer, 2000 nodes/layer US English system and a large vocabulary language model. The system implements the multiframe architecture, but for the purpose of benchmarking, the same output layer was used for each time step. For a system of that size trained on a large amount of data, the performance gain from training distinct layers is negligible.
    Google ScholarFindings
Full Text
Your rating :
0

 

Tags
Comments