whisper-large-v3-sk

Slovak fine-tune of openai/whisper-large-v3 by the KInIT team. Full parameter fine-tuning on a curated Slovak speech corpus.

Model Details

Property Value
Base model openai/whisper-large-v3
Parameters ~1.55B
Architecture Whisper encoder-decoder
Fine-tuning method Full fine-tuning
Language Slovak (sk)
Task Automatic Speech Recognition
License MIT

Intended Use

This model is intended for Slovak automatic speech recognition across a range of domains and recording conditions.

Out-of-scope: Non-Slovak audio, real-time streaming without appropriate chunking, safety-critical transcription without human review.

Training Data

Fine-tuned on an internal curated Slovak speech corpus compiled at KInIT. The corpus combines public datasets with internal KInIT recordings. Recordings containing personal data were anonymised prior to use. Samples were quality-filtered using a CER-based threshold validated against multiple ASR models.

Data sources
SloPalSpeech
Municipal council session recordings
Read literature
Mozilla Common Voice
TEDxSK and JumpSK Lecture Speech Corpus
FLEURS read speech
Internal KInIT recordings

Training Procedure

Hyperparameter Value
Epochs 2
Learning rate 5e-5
LR scheduler Linear with warmup
Optimizer AdamW
Effective batch size 64
Precision fp16
Framework HuggingFace Transformers Seq2SeqTrainer

Training was performed on the Devana HPC cluster.

Evaluation

Model Common Voice 24 SK (test) Internal eval
WER โ†“ CER โ†“ WER โ†“ CER โ†“
openai/whisper-large-v3 TBD TBD TBD TBD
kinit/whisper-large-v3-sk TBD TBD TBD TBD
openai/whisper-large-v3-turbo TBD TBD TBD TBD
kinit/whisper-large-v3-turbo-sk TBD TBD TBD TBD

Usage

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

model_id = "kinit/whisper-large-v3-sk"

device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    dtype=dtype,
    use_safetensors=True,
).to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    dtype=dtype,
    device=device,
)

result = pipe("audio.wav", generate_kwargs={"language": "slovak"})
print(result["text"])

Limitations

  • Catastrophic forgetting: Fine-tuning exclusively on Slovak data significantly degrades performance on other languages. Use the base openai/whisper-large-v3 if multilingual transcription is required.
  • Performance may degrade on strongly accented, dialectal, or domain-specific speech not represented in the training data.
  • Maximum reliable single-segment length is 30 seconds without chunking.

Acknowledgements

Public datasets used in training: SloPalSpeech, Mozilla Common Voice, FLEURS, and the TEDxSK and JumpSK Lecture Speech Corpus (KEMT NLP).

(Part of the) Research results was obtained using the computational resources procured in the national project National competence centre for high performance computing (project code: 311070AKF2) funded by European Regional Development Fund, EU Structural Funds Informatization of society, Operational Program Integrated Infrastructure.

Downloads last month
37
Safetensors
Model size
2B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for kinit/whisper-large-v3-sk

Finetuned
(862)
this model

Collection including kinit/whisper-large-v3-sk