Whisper Medium — Hindi Fine-tuned

Fine-tuned version of openai/whisper-medium on the AI4Bharat Kathbath Hindi speech dataset for automatic speech recognition (ASR) in Hindi.

Model Details

Developed by: ShaikhAnis007
Model type: Automatic Speech Recognition (ASR)
Language: Hindi (hi) — Devanagari script
License: Apache 2.0
Base model: openai/whisper-medium
Fine-tuned on: AI4Bharat Kathbath — Hindi subset
Training compute: Google Colab T4 GPU (free tier)

Results

Metric	Base whisper-medium	After Fine-tuning	Improvement
WER	0.4133 (41.3%)	0.2318 (23.2%)	43.9% ↓
CER	0.2292 (22.9%)	0.0704 (7.0%)	69.3% ↓

Evaluated on 50 examples from the Kathbath valid split (never seen during training).

How to Use

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch

processor = WhisperProcessor.from_pretrained("ShaikhAnis007/whisper-medium-hindi")
model = WhisperForConditionalGeneration.from_pretrained("ShaikhAnis007/whisper-medium-hindi")

# Load your audio (must be 16kHz mono)
# audio = load your audio array here

inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
    predicted_ids = model.generate(
        inputs.input_features,
        language="hindi",
        task="transcribe"
    )

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Training Details

Dataset

Dataset: AI4Bharat Kathbath — Hindi subset
Training samples: 1,000 examples (~1 hour of audio)
Validation samples: 50 examples
Speaker diversity: 1,218 unique speakers across 203 districts of India
License: CC0 (fully open)

Hyperparameters

Parameter	Value
Learning rate	1e-5
LR scheduler	Linear with warmup
Warmup steps	50
Epochs	4
Batch size (physical)	1
Gradient accumulation	8 (effective batch = 8)
fp16	True
gradient_checkpointing	True

Training Hardware

Hardware: Google Colab T4 GPU (16GB VRAM)
Training time: ~60 minutes
Cloud Provider: Google Colab (free tier)

Intended Use

Hindi speech transcription from audio files
Voice-to-text applications in Hindi
Hindi ASR research and benchmarking

Limitations

Trained on 1,000 samples — production use would benefit from the full Kathbath dataset (~140 hours)
Optimised for read speech; may perform differently on conversational Hindi
No data augmentation applied — robustness to noisy environments is limited

Citation

If you use this model, please cite the base model and dataset:

@misc{radford2022whisper,
  title={Robust Speech Recognition via Large-Scale Weak Supervision},
  author={Radford, Alec and others},
  year={2022}
}

@inproceedings{kathbath2022,
  title={Kathbath: A Robust Dataset for Hindi ASR},
  author={AI4Bharat},
  year={2022}
}

Downloads last month: 14

Safetensors

Model size

0.8B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ShaikhAnis007/whisper-medium-hindi

Base model

openai/whisper-medium

Finetuned

(874)

this model

ShaikhAnis007
/

whisper-medium-hindi