SpecDox: Fast Urdu-to-English Speech Translation Model

This is a highly optimized, fine-tuned version of the OpenAI Whisper Medium model. It is explicitly trained to perform Automatic Speech Recognition (ASR) and Audio Translation, taking spoken Urdu (اردو) and instantly converting it into written English text.

This model serves as the core audio-processing engine for SpecDox, a real-time Urdu-to-English Speech-to-Structured-Document system.

πŸš€ Key Features & SEO Highlights

  • High Speed & Low VRAM: Built on the 769M parameter Whisper Medium architecture. We chose this over the Large model to maximize GPU efficiency, ensure fast inference speeds, and allow for deployment on consumer-grade hardware.
  • Massive Training Data: Trained on 127 hours of high-quality Urdu-to-English baseline speech data, expanded to a massive 172 hours through extensive data augmentation.
  • PEFT / LoRA Optimized: Fine-tuned using Parameter-Efficient Fine-Tuning (LoRA adapters) and merged into FP16/BF16 weights for a lightweight footprint without sacrificing domain specificity.

πŸ“Š Evaluation & Performance

The table below outlines the performance of the fine-tuned SpecDox models against standard baseline architectures. Evaluation was conducted across four major benchmarks: Word Error Rate (WER), BLEU score, METEOR, and BERTScore F1.

Model WER% ↓ BLEU ↑ METEOR ↑ BERTScore F1 ↑ Rank
SpecDox-Whisper-Medium 36.25 53.30 0.7804 0.9405 #1
Faster Whisper (SpecDox) 36.28 53.24 0.7811 0.9402 #2
Whisper Large-v3 42.88 46.86 0.7105 0.9270 #3
Whisper Medium (Baseline) 45.33 44.16 0.6882 0.9226 #4
SeamlessM4T Medium 72.04 18.84 0.3697 0.8429 #5

Engineering Takeaway: Despite being a lighter architecture, the fine-tuned SpecDox Medium model outperforms the baseline Whisper Large-v3 by a massive 6.63% absolute reduction in WER and yields significantly higher translation quality metrics (BLEU/METEOR). This justifies the choice of Whisper Medium for production environments requiring fast inference speeds and low GPU footprints.


πŸ’‘ Ideal Use Cases

If you are searching for a model to handle the following tasks, this model is built for you:

  • Real-Time Translation: Live transcription and translation of Urdu audio, podcasts, or lectures into English.
  • Voice-to-Text Document Generation: Converting dictated Urdu notes into structured English reports (the primary function of SpecDox).
  • Cross-Lingual ASR: Handling Pakistani accents and regional Urdu pronunciations with high accuracy.
  • Edge Deployment: Running high-accuracy audio translation on hardware with limited VRAM (Google Colab free tier, local RTX GPUs, etc.).

πŸ’» How to Use in Python

from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load the SpecDox Whisper model
processor = WhisperProcessor.from_pretrained("Shzaib/SpecDox-Whisper-Medium")
model = WhisperForConditionalGeneration.from_pretrained("Shzaib/SpecDox-Whisper-Medium").to(device)

def translate_urdu_audio(audio_array, sampling_rate=16000):
    inputs = processor(audio_array, sampling_rate=sampling_rate, return_tensors="pt").to(device)
    
    # Force the decoder to translate Urdu to English
    forced_decoder_ids = processor.get_decoder_prompt_ids(language="urdu", task="translate")
    predicted_ids = model.generate(inputs["input_features"], forced_decoder_ids=forced_decoder_ids)
    
    return processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
Downloads last month
-
Safetensors
Model size
0.8B params
Tensor type
F16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Shzaib/SpecDox-Whisper-Medium

Adapter
(134)
this model