A Weakly-Supervised Streaming Multilingual Speech Model with Truly Zero-Shot Capability

Jian Xue; Peidong Wang; Jinyu Li; Eric Sun

A Weakly-Supervised Streaming Multilingual Speech Model with Truly Zero-Shot Capability

Jian Xue ,
Peidong Wang ,
Jinyu Li ,
Eric Sun

Workshop of Automatic Speech Recognition and Understanding | December 2023

Organized by IEEE

Download BibTex

Streaming automatic speech recognition (ASR) and speech translation (ST) tasks have extensively utilized neural transducers. In this paper, we present our endeavor to construct a Streaming Multilingual Speech Model (\(SM^2\)), which employs a single neural transducer model for transcribing or translating multiple languages into target languages. \(SM^2\) is trained using weakly supervised data created by converting speech recognition transcriptions with a machine translation model. Leveraging 351 thousand hours of speech training data from 25 languages, \(SM^2\) achieves impressive ST performance. Furthermore, we demonstrate the truly zero-shot capability of \(SM^2\) when expanding to new target languages, generating high-quality zero-shot ST translation for \{source-speech, target-text\} pairs that were not seen during training.