whisper-large-v3-tulpar

A fine-tuned Whisper Large V3 model optimized for Kazakh (қазақ тілі) speech recognition.

"Жігіт - ісімен, ат - тұлпарымен." (A man is known by his deeds, a horse — by its spirit.)

We got tired of Whisper confusing Kazakh with Turkish and hallucinating random text, so we fixed it. 900+ hours of Kazakh speech, 4x H200 GPUs, and a lot of tea later — here we are.

What's this?

Full fine-tune of all 1.55B parameters of openai/whisper-large-v3 on ~900 hours of Kazakh speech data. Not LoRA, not adapters — the whole thing, because we had the GPU memory and the ambition.

Current WER: 10.62% on ISSAI KSC test set (down from ~20%+ with vanilla Whisper).

This is v2. We're not done yet.

Quick Start

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa

# Load
processor = WhisperProcessor.from_pretrained("olzhasAl/whisper-large-v3-turbo-kk")
model = WhisperForConditionalGeneration.from_pretrained("olzhasAl/whisper-large-v3-turbo-kk")
model.eval()

# Transcribe
audio, sr = librosa.load("kazakh_audio.wav", sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
forced_ids = processor.get_decoder_prompt_ids(language="kazakh", task="transcribe")

generated = model.generate(
    inputs.input_features,
    forced_decoder_ids=forced_ids,
    max_new_tokens=256,
    num_beams=5,
    no_repeat_ngram_size=4,
)

text = processor.batch_decode(generated, skip_special_tokens=True)[0]
print(text)

Training Details

Parameter Value
Base model openai/whisper-large-v3
Parameters 1.55B (full fine-tune)
Training data ~900 hours Kazakh speech
GPUs 4x NVIDIA H200 (143GB HBM3e each)
Precision BF16 mixed precision
Optimizer AdamW (lr=1e-5, cosine schedule)
Batch size 256 effective (32/GPU × 4 GPUs × 2 accum)
Epochs 1

Data Sources

Dataset Hours Type
ISSAI KSC ~335h Studio recordings
farabi-lab ~554h Diverse speech
FLEURS kk_kz ~12h Standard phrases
YouTube (curated) ~57h News, interviews, lectures

Total: ~957 hours of verified Kazakh speech.

Preprocessing

  • Audio: 16kHz mono WAV
  • VAD: Silero VAD (float32, threshold=0.5)
  • Segments: 5-30 seconds
  • Language filtering: Kazakh-specific character detection (ә, і, ң, ғ, ү, ұ, қ, ө, һ)

Evaluation

Benchmark WER
ISSAI KSC test 10.62%
FLEURS kk_kz test TBD

Target: < 8% (v3), < 5% (v4+)

Tips & Tricks

For best results:

  • Use num_beams=5 and no_repeat_ngram_size=4 — prevents repetition loops
  • Set language="kazakh" explicitly — auto-detect may confuse KK with Turkish
  • For mixed KK/RU audio: use this model for Kazakh segments + original Whisper for Russian segments

Known quirks:

  • Language detection head is biased toward Kazakh (trained only on KK data)
  • May hallucinate on pure Russian input — don't force language="kazakh" on Russian speech
  • Code-switching (KZ↔RU within one sentence) is partially supported but not perfect yet

What's Next

  • v3: Training on ~3000+ hours (more YouTube data, podcasts)
  • Sub-8% WER target
  • Better code-switching support
  • KenLM integration for post-processing

Citation

@misc{whisper-large-v3-turbo-kk,
  author = {Olzhas Alseitov},
  title = {whisper-large-v3-turbo-kk: Fine-tuned Whisper for Kazakh Speech Recognition},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/olzhasAl/whisper-large-v3-turbo-kk}
}

License

Apache 2.0


Downloads last month
36
Safetensors
Model size
2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for olzhasAl/whisper-large-v3-tulpar

Finetuned
(861)
this model

Dataset used to train olzhasAl/whisper-large-v3-tulpar

Evaluation results