Whisper Medium β€” Hindi Fine-tuned

Fine-tuned version of openai/whisper-medium on the AI4Bharat Kathbath Hindi speech dataset for automatic speech recognition (ASR) in Hindi.

Model Details

  • Developed by: ShaikhAnis007
  • Model type: Automatic Speech Recognition (ASR)
  • Language: Hindi (hi) β€” Devanagari script
  • License: Apache 2.0
  • Base model: openai/whisper-medium
  • Fine-tuned on: AI4Bharat Kathbath β€” Hindi subset
  • Training compute: Google Colab T4 GPU (free tier)

Results

Metric Base whisper-medium After Fine-tuning Improvement
WER 0.4133 (41.3%) 0.2318 (23.2%) 43.9% ↓
CER 0.2292 (22.9%) 0.0704 (7.0%) 69.3% ↓

Evaluated on 50 examples from the Kathbath valid split (never seen during training).

How to Use

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch

processor = WhisperProcessor.from_pretrained("ShaikhAnis007/whisper-medium-hindi")
model = WhisperForConditionalGeneration.from_pretrained("ShaikhAnis007/whisper-medium-hindi")

# Load your audio (must be 16kHz mono)
# audio = load your audio array here

inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
    predicted_ids = model.generate(
        inputs.input_features,
        language="hindi",
        task="transcribe"
    )

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Training Details

Dataset

  • Dataset: AI4Bharat Kathbath β€” Hindi subset
  • Training samples: 1,000 examples (~1 hour of audio)
  • Validation samples: 50 examples
  • Speaker diversity: 1,218 unique speakers across 203 districts of India
  • License: CC0 (fully open)

Hyperparameters

Parameter Value
Learning rate 1e-5
LR scheduler Linear with warmup
Warmup steps 50
Epochs 4
Batch size (physical) 1
Gradient accumulation 8 (effective batch = 8)
fp16 True
gradient_checkpointing True

Training Hardware

  • Hardware: Google Colab T4 GPU (16GB VRAM)
  • Training time: ~60 minutes
  • Cloud Provider: Google Colab (free tier)

Intended Use

  • Hindi speech transcription from audio files
  • Voice-to-text applications in Hindi
  • Hindi ASR research and benchmarking

Limitations

  • Trained on 1,000 samples β€” production use would benefit from the full Kathbath dataset (~140 hours)
  • Optimised for read speech; may perform differently on conversational Hindi
  • No data augmentation applied β€” robustness to noisy environments is limited

Citation

If you use this model, please cite the base model and dataset:

@misc{radford2022whisper,
  title={Robust Speech Recognition via Large-Scale Weak Supervision},
  author={Radford, Alec and others},
  year={2022}
}

@inproceedings{kathbath2022,
  title={Kathbath: A Robust Dataset for Hindi ASR},
  author={AI4Bharat},
  year={2022}
}
Downloads last month
14
Safetensors
Model size
0.8B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for ShaikhAnis007/whisper-medium-hindi

Finetuned
(874)
this model

Dataset used to train ShaikhAnis007/whisper-medium-hindi