File size: 1,096 Bytes
b0add0d c2aaf6e b0add0d c637f39 b0add0d c637f39 b0add0d c637f39 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
---
tags:
- espnet
- audio
- automatic-speech-recognition
- speech-translation
- language-identification
language: multilingual
datasets:
- owsm_v3.2_ctc
base_model:
- espnet/owsm_ctc_v3.1_1B
license: cc-by-4.0
---
[OWSM-CTC](https://aclanthology.org/2024.acl-long.549/) (Peng et al., ACL 2024) is an encoder-only speech foundation model based on hierarchical multi-task self-conditioned CTC.
It is trained on 180k hours of public audio data for multilingual speech recognition, any-to-any speech translation, and language identification, which follows the design of the project, [Open Whisper-style Speech Model (OWSM)](https://arxiv.org/abs/2401.16658).
This model is initialized with [OWSM-CTC v3.1](https://huggingface.co/pyf98/owsm_ctc_v3.1_1B) and then fine-tuned on [v3.2 data](https://arxiv.org/abs/2406.09282) for 225k steps.
To use the pre-trained model, please install `espnet` and `espnet_model_zoo`. The requirements are:
```
librosa
torch
espnet
espnet_model_zoo
```
**Example usage can be found in ESPnet:** https://github.com/espnet/espnet/tree/master/egs2/owsm_ctc_v3.1/s2t1 |