CT2 Whisper Medium — Malayalam R-MFT
CTranslate2 conversion of adalat-ai/whisper-medium-ml-rmft, optimised for fast CPU/GPU inference via faster-whisper.
This model was introduced in the paper Vividh-ASR: A Complexity-Tiered Benchmark and Optimization Dynamics for Robust Indic Speech Recognition.
Model Description
The source model is a fine-tuned Malayalam ASR model based on openai/whisper-medium, trained using the Reverse Multi-Stage Fine-Tuning (R-MFT) recipe introduced in Vividh-ASR: Diagnosing and Fixing Studio-Bias in Whisper for Indic Languages.
R-MFT trains in three stages with a decreasing learning rate schedule, presenting the hardest acoustic data first during the highest-plasticity phase:
| Stage | Data | LR |
|---|---|---|
| 1 | Tier C — Spontaneous (~512.5 hrs) | 2e-4 |
| 2 | Tier B — Broadcast (~200 hrs) | 1e-4 |
| 3 | Tier A — Studio + Tier C mix (~182.2 hrs) | 1e-5 |
Benchmark Results (Vividh-ASR)
Benchmark WER is measured using faster-whisper with 7s VAD segmentation for long-form audio. See the blogpost for full evaluation details.
| Model | Tier A (Studio) | Tier B (Broadcast) | Tier C (Spontaneous) | Tier D (Noise) | Global |
|---|---|---|---|---|---|
| whisper-medium-ml-high-lr | 35.04 | 30.48 | 50.30 | 50.78 | 40.85 |
| whisper-medium-ml-rmft (source model) | 37.56 | 31.66 | 46.10 | 45.73 | 39.64 |
| whisper-small-ml-high-lr | 39.05 | 32.50 | 54.39 | 51.08 | 43.93 |
| whisper-small-ml-rmft | 40.26 | 35.05 | 53.77 | 48.04 | 44.53 |
| IndicWhisper | 38.07 | 32.43 | 65.74 | 46.92 | 47.96 |
| Vegam Whisper | 38.74 | 55.10 | 58.53 | 54.46 | 53.39 |
WER %. Lower is better.
Usage
from faster_whisper import WhisperModel
model = WhisperModel(
"adalat-ai/ct2-whisper-medium-ml-rmft",
device="cuda",
compute_type="float16"
)
segments, info = model.transcribe("audio.wav", vad_filter=True, vad_parameters={"max_speech_duration_s": 7})
for segment in segments:
print(f"{segment.start:.2f} - {segment.end:.2f}: {segment.text}")
Note: Benchmark results use 7s VAD segmentation (
vad_filter=True,max_speech_duration_s=7). For short clips, VAD is not required.
Training Data
Training data is a superset of the Vividh-ASR benchmark evaluation splits.
| Tier | Hours | Sources |
|---|---|---|
| A (Studio) | 182.2 | Fleurs, IndicTTS, OpenSLR, IMASC |
| B (Broadcast) | 200.0 | Shrutilipi |
| C (Spontaneous) | 512.5 | IndicVoices, Common Voice |
| Total | 894.7 |
Citation
If you use this model or the Vividh-ASR benchmark, please cite:
@misc{vividhasr2025,
title = {Vividh-ASR: Diagnosing and Fixing Studio-Bias in Whisper
for Indic Languages},
author = {Kush Juvekar, Kavya Manohar, Kumaramanas Nethil},
year = {2026},
url = {https://huggingface.co/blog/adalat-ai/vividh-benchmark}
}
@misc{vividh2026,
title={Vividh-ASR: A Complexity-Tiered Benchmark and Optimization Dynamics for Robust Indic Speech Recognition},
author={Kush Juvekar, Kavya Manohar, Aditya Srinivas Menon, Arghya Bhattacharya, Kumarmanas Nethil},
year={2026},
eprint={2605.13087},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2605.13087},
}
Related Models and Datasets
- Source Transformers model: adalat-ai/whisper-medium-ml-rmft
- See the full Vividh collection.
Developed by Adalat AI. Released under Apache 2.0.
- Downloads last month
- 24
Model tree for adalat-ai/ct2-whisper-medium-ml-rmft
Base model
openai/whisper-medium