CT2 Whisper Medium — Malayalam R-MFT

CTranslate2 conversion of adalat-ai/whisper-medium-ml-rmft, optimised for fast CPU/GPU inference via faster-whisper.

This model was introduced in the paper Vividh-ASR: A Complexity-Tiered Benchmark and Optimization Dynamics for Robust Indic Speech Recognition.


Model Description

The source model is a fine-tuned Malayalam ASR model based on openai/whisper-medium, trained using the Reverse Multi-Stage Fine-Tuning (R-MFT) recipe introduced in Vividh-ASR: Diagnosing and Fixing Studio-Bias in Whisper for Indic Languages.

R-MFT trains in three stages with a decreasing learning rate schedule, presenting the hardest acoustic data first during the highest-plasticity phase:

Stage Data LR
1 Tier C — Spontaneous (~512.5 hrs) 2e-4
2 Tier B — Broadcast (~200 hrs) 1e-4
3 Tier A — Studio + Tier C mix (~182.2 hrs) 1e-5

Benchmark Results (Vividh-ASR)

Benchmark WER is measured using faster-whisper with 7s VAD segmentation for long-form audio. See the blogpost for full evaluation details.

Model Tier A (Studio) Tier B (Broadcast) Tier C (Spontaneous) Tier D (Noise) Global
whisper-medium-ml-high-lr 35.04 30.48 50.30 50.78 40.85
whisper-medium-ml-rmft (source model) 37.56 31.66 46.10 45.73 39.64
whisper-small-ml-high-lr 39.05 32.50 54.39 51.08 43.93
whisper-small-ml-rmft 40.26 35.05 53.77 48.04 44.53
IndicWhisper 38.07 32.43 65.74 46.92 47.96
Vegam Whisper 38.74 55.10 58.53 54.46 53.39

WER %. Lower is better.


Usage

from faster_whisper import WhisperModel

model = WhisperModel(
    "adalat-ai/ct2-whisper-medium-ml-rmft",
    device="cuda",
    compute_type="float16"
)
segments, info = model.transcribe("audio.wav", vad_filter=True, vad_parameters={"max_speech_duration_s": 7})

for segment in segments:
    print(f"{segment.start:.2f} - {segment.end:.2f}: {segment.text}")

Note: Benchmark results use 7s VAD segmentation (vad_filter=True, max_speech_duration_s=7). For short clips, VAD is not required.


Training Data

Training data is a superset of the Vividh-ASR benchmark evaluation splits.

Tier Hours Sources
A (Studio) 182.2 Fleurs, IndicTTS, OpenSLR, IMASC
B (Broadcast) 200.0 Shrutilipi
C (Spontaneous) 512.5 IndicVoices, Common Voice
Total 894.7

Citation

If you use this model or the Vividh-ASR benchmark, please cite:

@misc{vividhasr2025,
  title   = {Vividh-ASR: Diagnosing and Fixing Studio-Bias in Whisper
             for Indic Languages},
  author  = {Kush Juvekar, Kavya Manohar, Kumaramanas Nethil},
  year    = {2026},
  url     = {https://huggingface.co/blog/adalat-ai/vividh-benchmark}
}
@misc{vividh2026,
      title={Vividh-ASR: A Complexity-Tiered Benchmark and Optimization Dynamics for Robust Indic Speech Recognition}, 
      author={Kush Juvekar, Kavya Manohar, Aditya Srinivas Menon, Arghya Bhattacharya, Kumarmanas Nethil},
      year={2026},
      eprint={2605.13087},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.13087}, 
}

Related Models and Datasets


Developed by Adalat AI. Released under Apache 2.0.

Downloads last month
24
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for adalat-ai/ct2-whisper-medium-ml-rmft

Finetuned
(1)
this model

Collection including adalat-ai/ct2-whisper-medium-ml-rmft

Paper for adalat-ai/ct2-whisper-medium-ml-rmft