OWSM: Open Whisper-style Speech Model

OWSM is an Open Whisper-style Speech Model from CMU WAVLab. It reproduces Whisper-style training using publicly available data and an open-source toolkit ESPnet.

Our demo is available here. The project page contains various resources.

OWSM v3 has 889M parameters and is trained on 180k hours of public speech data. It supports various speech-to-text tasks:

  • Speech recognition
  • Any-to-any-language speech translation
  • Utterance-level alignment
  • Long-form transcription
  • Language identification

For more details and results, please check out our paper (Peng et al., ASRU 2023).

Citing OWSM and ESPnet

