ASTRA: Aligning Speech and Text Representations for Asr Without Sampling
Interspeech 2024(2024)
Abstract
This paper introduces ASTRA, a novel method for improving Automatic SpeechRecognition (ASR) through text injection.Unlike prevailing techniques, ASTRAeliminates the need for sampling to match sequence lengths between speech andtext modalities. Instead, it leverages the inherent alignments learned withinCTC/RNNT models. This approach offers the following two advantages, namely,avoiding potential misalignment between speech and text features that couldarise from upsampling and eliminating the need for models to accurately predictduration of sub-word tokens. This novel formulation of modality (length)matching as a weighted RNNT objective matches the performance of thestate-of-the-art duration-based methods on the FLEURS benchmark, while openingup other avenues of research in speech processing.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined