Conformer-Based on-Device Streaming Speech Recognition with KD Compression and Two-Pass Architecture

2022 IEEE Spoken Language Technology Workshop (SLT)(2023)

Cited 1|Views44
No score
Abstract
This paper introduces a two-pass on-device automatic speech recognition (ASR) system, which is developed for commercialized devices. The first pass of the system is based on a causal Conformer-transducer model to generate partial results from the input audio stream. After processing an entire input utterance in the first pass, the candidates for the final result are rescored with a full-context attention model in the second pass. To minimize the computational overhead from rescoring, we compress the full-context model by applying knowledge distillation (KD). The total model size is reduced by 35% after KD with a 0.02% absolute loss in word error rate (WER). We also introduce decoding techniques to boost the accuracy on the test cases mismatched with the distribution of the training set. The techniques include on-device personal adaptation, spell correction and handling incorrectly segmented speech, which solve the critical issues for production-grade systems. The whole system including the two-pass end-to-end (E2E) model and a language model (LM) occupies 72MB in storage after 8-bit quantization. We demonstrate the entire system on mobile devices and report results on test sets collected from the production environment. The developed system achieves 5.65% WER which surpasses the baseline system with 39% relative WER improvement.
More
Translated text
Key words
on-device speech recognition,Conformer,knowledge distillation,streaming speech recognition
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined