--- tags: - espnet - audio - automatic-speech-recognition - speech-translation language: multilingual datasets: - mixed_v3 license: cc-by-4.0 --- ## OWSM: Open Whisper-style Speech Model [OWSM](https://arxiv.org/abs/2309.13876) is an Open Whisper-style Speech Model from [CMU WAVLab](https://www.wavlab.org/). It reproduces Whisper-style training using publicly available data and an open-source toolkit [ESPnet](https://github.com/espnet/espnet). Our demo is available [here](https://huggingface.co/spaces/pyf98/OWSM_v3_demo). The [project page](https://www.wavlab.org/activities/2024/owsm/) contains various resources. OWSM v3 has 889M parameters and is trained on 180k hours of public speech data. It supports various speech-to-text tasks: - Speech recognition - Any-to-any-language speech translation - Utterance-level alignment - Long-form transcription - Language identification For more details and results, please check out our [paper](https://arxiv.org/abs/2309.13876) (Peng et al., ASRU 2023). ### Citing OWSM and ESPnet ```BibTex @article{peng2023owsm, title={Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data}, author={Yifan Peng and Jinchuan Tian and Brian Yan and Dan Berrebbi and Xuankai Chang and Xinjian Li and Jiatong Shi and Siddhant Arora and William Chen and Roshan Sharma and Wangyou Zhang and Yui Sudo and Muhammad Shakeel and Jee-weon Jung and Soumi Maiti and Shinji Watanabe}, journal={arXiv preprint arXiv:2309.13876}, year={2023} } @inproceedings{watanabe2018espnet, author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson Yalta and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai}, title={{ESPnet}: End-to-End Speech Processing Toolkit}, year={2018}, booktitle={Proceedings of Interspeech}, pages={2207--2211}, doi={10.21437/Interspeech.2018-1456}, url={http://dx.doi.org/10.21437/Interspeech.2018-1456} } ```