CrisperWhisper: Accurate Timestamps on Verbatim Speech Transcriptions
Paper • 2408.16589 • Published • 2
Verbatim Hindi ASR with word-level timestamps.
Fine-tuned from vasista22/whisper-hindi-large-v2 on IndicVoices-R Hindi using the CrisperWhisper dual-loss approach.
| Model | WER vs Normalised | WER vs Verbatim |
|---|---|---|
| Base (vasista22/whisper-hindi-large-v2) | 23.52% | 25.91% |
| Indic CrisperWhisper v1 | 18.86% | 18.60% |
pip install torch torchaudio "transformers>=4.37" huggingface_hub numpy
python transcribe_hindi.py audio.wav --model_dir user71/indic-crisperwhisper-hindi-v1
from transcribe_hindi import load_model, transcribe
pipe = load_model("user71/indic-crisperwhisper-hindi-v1")
result = transcribe("audio.wav", pipe=pipe)
print(result["text"])
for chunk in result["chunks"]:
word = chunk["text"]
start = chunk["timestamp"][0]
end = chunk["timestamp"][1]
print(f"{word:<20} {start:.3f}s – {end:.3f}s")
Output includes Hindi filler tokens: [UM], [UH], [FILLER], [PAUSE].
| File | Description |
|---|---|
model-*.safetensors |
Model weights (Whisper large-v2 based, 1.5B params) |
retok_config.json |
Retokenisation config (filler token IDs, vocab size=50368) |
alignment_heads.json |
15 selected cross-attention heads used for DTW timestamps |
Built on CrisperWhisper (Kohler et al., INTERSPEECH 2024) Training data: IndicVoices-R (Sangroya et al., 2024)
Sanat Kumar Agarwal and Srikanth Raj Chetupalli, "IndicCrisperWhisper-Time-stamped transcription for IndicVocies using CrisperWhisper", , May 2026.