A Weakly-Supervised Streaming Multilingual Speech Model with Truly Zero-Shot Capability

Workshop of Automatic Speech Recognition and Understanding |

Organized by IEEE

Streaming automatic speech recognition (ASR) and speech translation (ST) tasks have extensively utilized neural transducers. In this paper, we present our endeavor to construct a Streaming Multilingual Speech Model (\(SM^2\)), which employs a single neural transducer model for transcribing or translating multiple languages into target languages. \(SM^2\) is trained using weakly supervised data created by converting speech recognition transcriptions with a machine translation model. Leveraging 351 thousand hours of speech training data from 25 languages, \(SM^2\) achieves impressive ST performance. Furthermore, we demonstrate the truly zero-shot capability of \(SM^2\) when expanding to new target languages, generating high-quality zero-shot ST translation for \{source-speech, target-text\} pairs that were not seen during training.