whisper-large-v3-sk

Slovak fine-tune of openai/whisper-large-v3 by the KInIT team. Full parameter fine-tuning on a curated Slovak speech corpus.

Model Details

Property	Value
Base model	openai/whisper-large-v3
Parameters	~1.55B
Architecture	Whisper encoder-decoder
Fine-tuning method	Full fine-tuning
Language	Slovak (`sk`)
Task	Automatic Speech Recognition
License	MIT

Intended Use

This model is intended for Slovak automatic speech recognition across a range of domains and recording conditions.

Out-of-scope: Non-Slovak audio, real-time streaming without appropriate chunking, safety-critical transcription without human review.

Training Data

Fine-tuned on an internal curated Slovak speech corpus compiled at KInIT. The corpus combines public datasets with internal KInIT recordings. Recordings containing personal data were anonymised prior to use. Samples were quality-filtered using a CER-based threshold validated against multiple ASR models.

Data sources
SloPalSpeech
Municipal council session recordings
Read literature
Mozilla Common Voice
TEDxSK and JumpSK Lecture Speech Corpus
FLEURS read speech
Internal KInIT recordings

Training Procedure

Hyperparameter	Value
Epochs	2
Learning rate	5e-5
LR scheduler	Linear with warmup
Optimizer	AdamW
Effective batch size	64
Precision	fp16
Framework	HuggingFace Transformers `Seq2SeqTrainer`

Training was performed on the Devana HPC cluster.

Evaluation

Model	Common Voice 24 SK (test)		Internal eval
	WER ↓	CER ↓	WER ↓	CER ↓
openai/whisper-large-v3	TBD	TBD	TBD	TBD
kinit/whisper-large-v3-sk	TBD	TBD	TBD	TBD
openai/whisper-large-v3-turbo	TBD	TBD	TBD	TBD
kinit/whisper-large-v3-turbo-sk	TBD	TBD	TBD	TBD

Usage

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

model_id = "kinit/whisper-large-v3-sk"

device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    dtype=dtype,
    use_safetensors=True,
).to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    dtype=dtype,
    device=device,
)

result = pipe("audio.wav", generate_kwargs={"language": "slovak"})
print(result["text"])

Limitations

Catastrophic forgetting: Fine-tuning exclusively on Slovak data significantly degrades performance on other languages. Use the base openai/whisper-large-v3 if multilingual transcription is required.
Performance may degrade on strongly accented, dialectal, or domain-specific speech not represented in the training data.
Maximum reliable single-segment length is 30 seconds without chunking.

Acknowledgements

Public datasets used in training: SloPalSpeech, Mozilla Common Voice, FLEURS, and the TEDxSK and JumpSK Lecture Speech Corpus (KEMT NLP).

(Part of the) Research results was obtained using the computational resources procured in the national project National competence centre for high performance computing (project code: 311070AKF2) funded by European Regional Development Fund, EU Structural Funds Informatization of society, Operational Program Integrated Infrastructure.

Downloads last month: 37

Safetensors

Model size

2B params

Tensor type

F32

Model tree for kinit/whisper-large-v3-sk

Base model

openai/whisper-large-v3

Finetuned

(862)

this model

Collection including kinit/whisper-large-v3-sk

ASR

Collection

4 items • Updated about 12 hours ago