Automatic Speech Recognition
Transformers
Safetensors
Panjabi
whisper
asr
gurbani
gurmukhi
punjabi
sehaj-path
kirtan
Eval Results (legacy)

Surt v3 — Whisper-small fine-tune for Gurbani (Sehaj Path + Kirtan)

Surt v3 is a fine-tuned openai/whisper-small for automatic speech recognition of Gurbani in Gurmukhi script — covering both sehaj-path (calm recitation) and kirtan (sung/musical) traditions. Trained on ~660h of clean, canonicalized Gurbani audio.

This is the final-step (step 12000) model. For the step-11000 best-sehaj-WER checkpoint, see surindersinghssj/surt-small-v3-training.

Headline results

Evaluated on the matched canonical eval sets:

Domain Dataset WER CER
Sehaj gurbani-sehajpath-yt-captions-eval-canonical 16.31 5.25
Kirtan gurbani-kirtan-yt-captions-eval-canonical 54.80 28.00

Best step-11000 checkpoint hits sehaj WER 15.84 / CER 5.15 — available on the training repo (see below).

Highlights

  • Cold start from base openai/whisper-small — no warm-start, no inherited v1/v2 bias
  • Sehaj WER 16.31% — major improvement over Surt v2's ~24% (data-leak-adjusted)
  • Kirtan WER 54.80% — comparable to v2's ~55% but v3 trained on 15× more kirtan, much better CER (28.00 vs v2's higher)
  • 660h of canonicalized Gurbani (sehaj + kirtan), all labels aligned to Sri Guru Granth Sahib Ji
  • Gurmukhi output (ਗੁਰਮੁਖੀ) — ॥ verse markers stripped from labels during training
  • Language tag: Punjabi (pa), task=transcribe

Quick start

from transformers import pipeline

pipe = pipeline(
    "automatic-speech-recognition",
    model="surindersinghssj/surt-small-v3",
    chunk_length_s=30,
)
result = pipe("path/to/audio.wav", generate_kwargs={"language": "punjabi", "task": "transcribe"})
print(result["text"])

Or with direct model / processor access:

from transformers import WhisperForConditionalGeneration, WhisperProcessor

processor = WhisperProcessor.from_pretrained("surindersinghssj/surt-small-v3", language="punjabi", task="transcribe")
model = WhisperForConditionalGeneration.from_pretrained("surindersinghssj/surt-small-v3")
model.generation_config.language = "punjabi"
model.generation_config.task = "transcribe"
model.generation_config.forced_decoder_ids = None

Live demo: Gradio Space.

Training data

Source Repo Hours Role
New sehaj (publicly available recordings with aligned transcripts) gurbani-sehajpath-yt-captions-canonical ~160h Primary sehaj stream
Old sehaj (studio) gurbani-sehajpath ~66h Extra sehaj stream
Kirtan (publicly available recordings with aligned transcripts) gurbani-kirtan-yt-captions-300h-canonical ~420h Aux kirtan stream

Training mix: ~220h sehaj (36%) + ~420h kirtan (64%) via oversampling aux at AUX_TRAIN_PROBABILITY=0.64. Text column normalized to final_text on all canonical sources. Old sehaj's gurmukhi_text column renamed at load.

Evaluation

Sehaj trajectory

Step WER CER
500 27.57 8.30
2000 26.17 8.20
5000 24.13 7.57
7500 16.97 5.54
9000 16.17 5.17
11000 15.84 5.15
12000 (final) 16.31 5.25

Kirtan (on the matched gurbani-kirtan-yt-captions-eval-canonical set, train split, 573 rows)

Step WER CER
7500 57.36 29.66
9000 54.92 28.27
12000 (final) 54.80 28.00

Note on the "pure" kirtan eval: an earlier eval used during training (gurbani-kirtan-eval-pure-canonical, eval split) showed inflated kirtan WER in the 119–135% range. That was a label-format mismatch — the "pure" eval references include extra markup the model was trained to strip. The numbers above use the matched canonical eval set.

Training procedure

  • Framework: 🤗 Transformers Seq2SeqTrainer (custom SurtTrainer with discriminative LR)
  • Precision: bf16
  • Attention: SDPA / Flash Attention 2 when installed
  • Hardware: 1× NVIDIA A40 (48 GB VRAM)
  • Wall-clock: ~6h 29m

Hyperparameters

Knob Value
max_steps 12000
Per-device batch 32
Grad accumulation 2
Effective batch 64
Encoder LR 5e-5
Decoder LR 3e-5
LR scheduler cosine
Warmup steps 900 (~7.5%)
Weight decay 0.01
Generation max length 448 tokens
Label-length filter drop rows with >448 tokens

Data augmentation (raw waveform, pre-feature-extraction)

  • Gaussian noise @ p=0.4
  • Room reverb @ p=0.3
  • Time stretch 0.9–1.1 @ p=0.1
  • No pitch shift (kirtan is tonal — pitch distortion corrupts raga tonal center)

Intended use

  • Transcription of sehaj-path recitation recordings (best-performing domain, WER 16.31%)
  • Transcription of Gurbani kirtan audio (WER 54.80%, CER 28.00% — sung/musical context)
  • Input audio: 16 kHz mono, ≤30 seconds per chunk (Whisper's native window; auto-chunked in the Gradio demo)

Out-of-scope / limitations

  • Not trained on non-Gurbani Punjabi speech — expect degradation on everyday Punjabi conversation
  • Not trained on English / Hindi / other languages — base Whisper multilingual ability is mostly lost after full fine-tune
  • Speaker / mic / recording bias — training data skews toward publicly available web-sourced recordings with aligned transcripts; live Darbar Sahib PA reverb or noisy field recordings may WER worse
  • Katha (spoken commentary) is not in the training distribution — quality will vary

How this compares to prior Surt versions

Version Base Training data Sehaj WER Kirtan WER
Surt v1 whisper-small 66h sehaj only ~24% (leak-inflated; true ~40–50%) — (hallucinates)
Surt v2 Surt v1 v1 + 28h noisy kirtan v2 ~regressed ~55% (28h kirtan)
Surt v3 whisper-small (cold start) 660h canonical mixed 16.31% (final) / 15.84% (best) 54.80% (final)

Surt v3 matches or slightly improves on v2's kirtan WER while using 15× more kirtan data and achieving a much cleaner, script-aware sehaj model — all from a cold-start base, eliminating leak concerns.

License

Apache 2.0, inheriting from openai/whisper-small.

Citation

@misc{surt-v3-2026,
  title        = {Surt v3: Whisper-small fine-tune for Gurbani ASR},
  author       = {Singh, Surinder},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/surindersinghssj/surt-small-v3}}
}
Downloads last month
409
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for surindersinghssj/surt-small-v3

Finetuned
(3489)
this model
Finetunes
1 model

Datasets used to train surindersinghssj/surt-small-v3

Space using surindersinghssj/surt-small-v3 1

Evaluation results

  • WER (step 12000, final) on gurbani-sehajpath-yt-captions-eval-canonical
    self-reported
    16.310
  • CER (step 12000, final) on gurbani-sehajpath-yt-captions-eval-canonical
    self-reported
    5.250
  • Best WER (step 11000) on gurbani-sehajpath-yt-captions-eval-canonical
    self-reported
    15.840
  • Best CER (step 11000) on gurbani-sehajpath-yt-captions-eval-canonical
    self-reported
    5.150
  • WER (step 12000, final) on gurbani-kirtan-yt-captions-eval-canonical
    self-reported
    54.800
  • CER (step 12000, final) on gurbani-kirtan-yt-captions-eval-canonical
    self-reported
    28.000