File size: 3,998 Bytes
2ac2327 128fdf1 8d39a71 ebfc7fb 8d39a71 2ac2327 128fdf1 515eb1a 2ac2327 128fdf1 2ac2327 515eb1a 128fdf1 2ac2327 128fdf1 2ac2327 128fdf1 2ac2327 128fdf1 2ac2327 128fdf1 2ac2327 128fdf1 da5719f 128fdf1 2ac2327 128fdf1 2ac2327 128fdf1 2ac2327 128fdf1 2ac2327 128fdf1 2ac2327 128fdf1 2ac2327 128fdf1 2ac2327 128fdf1 2ac2327 128fdf1 2ac2327 128fdf1 2ac2327 128fdf1 2ac2327 128fdf1 2ac2327 128fdf1 2ac2327 128fdf1 2ac2327 128fdf1 2ac2327 128fdf1 2ac2327 128fdf1 2ac2327 128fdf1 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 |
---
library_name: transformers
datasets:
- classla/Mici_Princ
language:
- hr
license: cc-by-sa-4.0
pipeline_tag: automatic-speech-recognition
base_model: openai/whisper-large-v3
widget:
- example_title: example 1
src: >-
https://huggingface.co/classla/whisper-large-v3-mici-princ/blob/main/MP_13_65.37-74.67.mp3
- example_title: example 2
src: >-
https://huggingface.co/classla/whisper-large-v3-mici-princ/blob/main/MP_15_201.53-210.02.mp3
- example_title: example 3
src: >-
https://huggingface.co/classla/whisper-large-v3-mici-princ/blob/main/MP_15_60.527-67.71.mp3
- example_title: example 4
src: >-
https://huggingface.co/classla/whisper-large-v3-mici-princ/blob/main/MP_15_68.5-72.45.mp3
metrics:
- wer
- cer
---
# Model Card for Model ID
This model was finetuned on [Mići Princ dataset](https://huggingface.co/datasets/classla/Mici_Princ),
the audiobook of the translation of _Le Petit Prince_ into the Chakavian dialect of Croatian.
## Model Details
### Model Description
<!-- Provide a longer summary of what this model is. -->
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
- **Developed by:** Nikola Ljubešić, Peter Rupnik, Tea Perinčić
- **Model type:** [More Information Needed]
- **Language(s) (NLP):** Croatian (hrv) - Chakavian dialect (ckm)
- **License:** Creative Commons - Share Alike 4.0
- **Finetuned from model:** openai/whisper-large-v3
### Model Sources
<!-- Provide the basic links for the model. -->
- **Repository:** [GitHub](https://github.com/5roop/mici_princ_whisper)
- **Paper:** Coming soon
- **Dataset:** [Mići Princ](https://huggingface.co/datasets/classla/Mici_Princ)
## Example use:
```python
import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from transformers.pipelines.pt_utils import KeyDataset
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_id = "classla/whisper-large-v3-mici-princ"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id,
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
ds = load_dataset("classla/Mici_Princ", split="test")
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=128,
chunk_length_s=30,
batch_size=16,
return_timestamps=True,
device=device,
)
result = pipe(
KeyDataset(ds, "audio"),
generate_kwargs={"language": "croatian"},
)
for i in result:
print(i)
# Output:
# {'text': ' Šesti planet je biv deset put veći. Na njin je bivav niki stari čovik ki je pisav vele knjige.', 'chunks': [{'timestamp': (0.0, 7.18), 'text': ' Šesti planet je biv deset put veći. Na njin je bivav niki stari čovik ki je pisav vele knjige.'}]}
# ...
```
## Training Details
#### Preprocessing
Model was trained on the `normalized_text` attribute of the [Mići Princ dataset](https://huggingface.co/datasets/classla/Mici_Princ). This means
that the data included capital letters and punctuation, except bullet points, newlines, and quotation marks. Special characters, present in
the dialect, but not in standard Croatian, were substituted.
Only the `train` split was used in training.
#### Training Hyperparameters
```
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=1e-5,
warmup_steps=100,
max_steps=309 * 10,
gradient_checkpointing=True,
predict_with_generate=True,
generation_max_length=225,
save_steps=309,
```
## Evaluation
For evaluation, the `test` split of the [Mići Princ dataset](https://huggingface.co/datasets/classla/Mici_Princ) was used.
#### Metrics
* WER: 0.04422
* CER: 0.16248
## Citation
Coming soon.
## Model Card Authors
Peter Rupnik
## Model Card Contact
[https://huggingface.co/5roop](https://huggingface.co/5roop) |