|
--- |
|
tags: |
|
- espnet |
|
- audio |
|
- automatic-speech-recognition |
|
- speech-translation |
|
- language-identification |
|
language: multilingual |
|
datasets: |
|
- owsm_v3.2_ctc |
|
license: cc-by-4.0 |
|
--- |
|
|
|
[OWSM-CTC](https://aclanthology.org/2024.acl-long.549/) (Peng et al., ACL 2024) is an encoder-only speech foundation model based on hierarchical multi-task self-conditioned CTC. |
|
It is trained on 180k hours of public audio data for multilingual speech recognition, any-to-any speech translation, and language identification, which follows the design of the project, [Open Whisper-style Speech Model (OWSM)](https://arxiv.org/abs/2401.16658). |
|
|
|
This model is initialized with [OWSM-CTC v3.1](https://huggingface.co/pyf98/owsm_ctc_v3.1_1B) and then fine-tuned on [v3.2 data](https://arxiv.org/abs/2406.09282) for 225k steps. |
|
|
|
To use the pre-trained model, please install `espnet` and `espnet_model_zoo`. The requirements are: |
|
``` |
|
librosa |
|
torch |
|
espnet |
|
espnet_model_zoo |
|
``` |
|
|
|
We use FlashAttention during training, but we do not need it during inference. Please install it as follows: |
|
```bash |
|
pip install flash-attn --no-build-isolation |
|
``` |
|
|
|
**Example usage can be found in ESPnet:** https://github.com/espnet/espnet/tree/master/egs2/owsm_ctc_v3.1/s2t1 |