whisper-large-v3-tulpar

A fine-tuned Whisper Large V3 model optimized for Kazakh (қазақ тілі) speech recognition.

"Жігіт - ісімен, ат - тұлпарымен." (A man is known by his deeds, a horse — by its spirit.)

We got tired of Whisper confusing Kazakh with Turkish and hallucinating random text, so we fixed it. 900+ hours of Kazakh speech, 4x H200 GPUs, and a lot of tea later — here we are.

What's this?

Full fine-tune of all 1.55B parameters of openai/whisper-large-v3 on ~900 hours of Kazakh speech data. Not LoRA, not adapters — the whole thing, because we had the GPU memory and the ambition.

Current WER: 10.62% on ISSAI KSC test set (down from ~20%+ with vanilla Whisper).

This is v2. We're not done yet.

Quick Start

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa

# Load
processor = WhisperProcessor.from_pretrained("olzhasAl/whisper-large-v3-turbo-kk")
model = WhisperForConditionalGeneration.from_pretrained("olzhasAl/whisper-large-v3-turbo-kk")
model.eval()

# Transcribe
audio, sr = librosa.load("kazakh_audio.wav", sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
forced_ids = processor.get_decoder_prompt_ids(language="kazakh", task="transcribe")

generated = model.generate(
    inputs.input_features,
    forced_decoder_ids=forced_ids,
    max_new_tokens=256,
    num_beams=5,
    no_repeat_ngram_size=4,
)

text = processor.batch_decode(generated, skip_special_tokens=True)[0]
print(text)

Training Details

Parameter	Value
Base model	`openai/whisper-large-v3`
Parameters	1.55B (full fine-tune)
Training data	~900 hours Kazakh speech
GPUs	4x NVIDIA H200 (143GB HBM3e each)
Precision	BF16 mixed precision
Optimizer	AdamW (lr=1e-5, cosine schedule)
Batch size	256 effective (32/GPU × 4 GPUs × 2 accum)
Epochs	1

Data Sources

Dataset	Hours	Type
ISSAI KSC	~335h	Studio recordings
farabi-lab	~554h	Diverse speech
FLEURS kk_kz	~12h	Standard phrases
YouTube (curated)	~57h	News, interviews, lectures

Total: ~957 hours of verified Kazakh speech.

Preprocessing

Audio: 16kHz mono WAV
VAD: Silero VAD (float32, threshold=0.5)
Segments: 5-30 seconds
Language filtering: Kazakh-specific character detection (ә, і, ң, ғ, ү, ұ, қ, ө, һ)

Evaluation

Benchmark	WER
ISSAI KSC test	10.62%
FLEURS kk_kz test	TBD

Target: < 8% (v3), < 5% (v4+)

Tips & Tricks

For best results:

Use num_beams=5 and no_repeat_ngram_size=4 — prevents repetition loops
Set language="kazakh" explicitly — auto-detect may confuse KK with Turkish
For mixed KK/RU audio: use this model for Kazakh segments + original Whisper for Russian segments

Known quirks:

Language detection head is biased toward Kazakh (trained only on KK data)
May hallucinate on pure Russian input — don't force language="kazakh" on Russian speech
Code-switching (KZ↔RU within one sentence) is partially supported but not perfect yet

What's Next

v3: Training on ~3000+ hours (more YouTube data, podcasts)
Sub-8% WER target
Better code-switching support
KenLM integration for post-processing

Citation

@misc{whisper-large-v3-turbo-kk,
  author = {Olzhas Alseitov},
  title = {whisper-large-v3-turbo-kk: Fine-tuned Whisper for Kazakh Speech Recognition},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/olzhasAl/whisper-large-v3-turbo-kk}
}

License

Apache 2.0

Downloads last month: 36

Safetensors

Model size

2B params

Tensor type

F32

Model tree for olzhasAl/whisper-large-v3-tulpar

Base model

openai/whisper-large-v3

Finetuned

(861)

this model

Dataset used to train olzhasAl/whisper-large-v3-tulpar

Evaluation results

WER on ISSAI KSC (test)
test set self-reported

10.620