espnet
/

owsm_ctc_v3.2_ft_1B

Automatic Speech Recognition

speech-translation

language-identification

Model card Files Files and versions Community

owsm_ctc_v3.2_ft_1B / README.md

pyf98's picture

Update README.md

c637f39 verified 5 days ago

|

history blame contribute delete

1.22 kB

	---
	tags:
	- espnet
	- audio
	- automatic-speech-recognition
	- speech-translation
	- language-identification
	language: multilingual
	datasets:
	- owsm_v3.2_ctc
	license: cc-by-4.0
	---

	[OWSM-CTC](https://aclanthology.org/2024.acl-long.549/) (Peng et al., ACL 2024) is an encoder-only speech foundation model based on hierarchical multi-task self-conditioned CTC.
	It is trained on 180k hours of public audio data for multilingual speech recognition, any-to-any speech translation, and language identification, which follows the design of the project, [Open Whisper-style Speech Model (OWSM)](https://arxiv.org/abs/2401.16658).

	This model is initialized with [OWSM-CTC v3.1](https://huggingface.co/pyf98/owsm_ctc_v3.1_1B) and then fine-tuned on [v3.2 data](https://arxiv.org/abs/2406.09282) for 225k steps.

	To use the pre-trained model, please install `espnet` and `espnet_model_zoo`. The requirements are:
	```
	librosa
	torch
	espnet
	espnet_model_zoo
	```

	We use FlashAttention during training, but we do not need it during inference. Please install it as follows:
	```bash
	pip install flash-attn --no-build-isolation
	```

	Example usage can be found in ESPnet: https://github.com/espnet/espnet/tree/master/egs2/owsm_ctc_v3.1/s2t1