Edit model card

OWSM: Open Whisper-style Speech Model

OWSM is an Open Whisper-style Speech Model from CMU WAVLab. It reproduces Whisper-style training using publicly available data and an open-source toolkit ESPnet.

Our demo is available here. The project page contains various resources.

OWSM v3 has 889M parameters and is trained on 180k hours of public speech data. It supports various speech-to-text tasks:

  • Speech recognition
  • Any-to-any-language speech translation
  • Utterance-level alignment
  • Long-form transcription
  • Language identification

For more details and results, please check out our paper (Peng et al., ASRU 2023).

Citing OWSM and ESPnet

  title={Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data},
  author={Yifan Peng and Jinchuan Tian and Brian Yan and Dan Berrebbi and Xuankai Chang and Xinjian Li and Jiatong Shi and Siddhant Arora and William Chen and Roshan Sharma and Wangyou Zhang and Yui Sudo and Muhammad Shakeel and Jee-weon Jung and Soumi Maiti and Shinji Watanabe},
  journal={arXiv preprint arXiv:2309.13876},
  author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson Yalta and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
  title={{ESPnet}: End-to-End Speech Processing Toolkit},
  booktitle={Proceedings of Interspeech},
Downloads last month

Space using espnet/owsm_v3 1