language:
- hr
license: cc-by-sa-4.0
library_name: transformers
base_model: openai/whisper-large-v3
datasets:
- classla/Mici_Princ
metrics:
- wer
- cer
pipeline_tag: automatic-speech-recognition
widget:
- example_title: example 1
src: >-
https://huggingface.co/classla/whisper-large-v3-mici-princ/raw/main/MP_13_65.37-74.67.mp3.wav
- example_title: example 2
src: >-
https://huggingface.co/classla/whisper-large-v3-mici-princ/raw/main/MP_15_201.53-210.02.mp3.wav
- example_title: example 3
src: >-
https://huggingface.co/classla/whisper-large-v3-mici-princ/raw/main/MP_15_60.527-67.71.mp3.wav
- example_title: example 4
src: >-
https://huggingface.co/classla/whisper-large-v3-mici-princ/raw/main/MP_15_68.5-72.45.mp3.wav
Model Card for Model ID
This model was finetuned on the Mići Princ dataset, the audiobook of the translation of Le Petit Prince into the Chakavian dialect of Croatian.
Model Details
Model Description
The model was finetuned for 80 epochs with an effective batch size of 16. Performance was inspected every 4 epochs, and the latest checkpoint is uploaded here.
- Developed by: Nikola Ljubešić, Peter Rupnik, Tea Perinčić
- Language(s) (NLP): Croatian (hrv) - Chakavian dialect (ckm)
- License: Creative Commons - Share Alike 4.0
- Finetuned from model: openai/whisper-large-v3
Model Sources
- Repository: GitHub
- Paper: Coming soon
- Dataset: Mići Princ
Example use:
import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from transformers.pipelines.pt_utils import KeyDataset
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_id = "classla/whisper-large-v3-mici-princ"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id,
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
ds = load_dataset("classla/Mici_Princ", split="test")
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=128,
chunk_length_s=30,
batch_size=16,
return_timestamps=True,
device=device,
)
result = pipe(
KeyDataset(ds, "audio"),
generate_kwargs={"language": "croatian"},
)
for i in result:
print(i)
# Output:
# {'text': ' Šesti planet je biv deset put veći. Na njin je bivav niki stari čovik ki je pisav vele knjige.', 'chunks': [{'timestamp': (0.0, 7.18), 'text': ' Šesti planet je biv deset put veći. Na njin je bivav niki stari čovik ki je pisav vele knjige.'}]}
# ...
Training Details
Preprocessing
Model was trained on the normalized_text
attribute of the Mići Princ dataset. This means
that the data included capital letters and punctuation, except bullet points, newlines, and quotation marks. Special characters, present in
the dialect, but not in standard Croatian, were substituted.
Only the train
split was used in training.
Training Hyperparameters
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=1e-5,
warmup_steps=100,
max_steps=277 * 80,
gradient_checkpointing=True,
predict_with_generate=True,
generation_max_length=225,
save_steps=277,
Evaluation
For evaluation, the test
split of the Mići Princ dataset was used. The test split consists of two known speakers, Autor and Mići Princ, and two unknown speakers, Geograf and Dilavac. Important to note is that each speaker uses a different micro-dialect, so the test data is challenging on including two new micro-dialects.
Metrics
- WER: 0.039493
- CER: 0.168341
Citation
Coming soon.
Model Card Authors
Peter Rupnik