SpecDox: Fast Urdu-to-English Speech Translation Model

This is a highly optimized, fine-tuned version of the OpenAI Whisper Medium model. It is explicitly trained to perform Automatic Speech Recognition (ASR) and Audio Translation, taking spoken Urdu (اردو) and instantly converting it into written English text.

This model serves as the core audio-processing engine for SpecDox, a real-time Urdu-to-English Speech-to-Structured-Document system.

🚀 Key Features & SEO Highlights

High Speed & Low VRAM: Built on the 769M parameter Whisper Medium architecture. We chose this over the Large model to maximize GPU efficiency, ensure fast inference speeds, and allow for deployment on consumer-grade hardware.
Massive Training Data: Trained on 127 hours of high-quality Urdu-to-English baseline speech data, expanded to a massive 172 hours through extensive data augmentation.
PEFT / LoRA Optimized: Fine-tuned using Parameter-Efficient Fine-Tuning (LoRA adapters) and merged into FP16/BF16 weights for a lightweight footprint without sacrificing domain specificity.

📊 Evaluation & Performance

The table below outlines the performance of the fine-tuned SpecDox models against standard baseline architectures. Evaluation was conducted across four major benchmarks: Word Error Rate (WER), BLEU score, METEOR, and BERTScore F1.

Model	WER% ↓	BLEU ↑	METEOR ↑	BERTScore F1 ↑	Rank
SpecDox-Whisper-Medium	36.25	53.30	0.7804	0.9405	#1
Faster Whisper (SpecDox)	36.28	53.24	0.7811	0.9402	#2
Whisper Large-v3	42.88	46.86	0.7105	0.9270	#3
Whisper Medium (Baseline)	45.33	44.16	0.6882	0.9226	#4
SeamlessM4T Medium	72.04	18.84	0.3697	0.8429	#5

Engineering Takeaway: Despite being a lighter architecture, the fine-tuned SpecDox Medium model outperforms the baseline Whisper Large-v3 by a massive 6.63% absolute reduction in WER and yields significantly higher translation quality metrics (BLEU/METEOR). This justifies the choice of Whisper Medium for production environments requiring fast inference speeds and low GPU footprints.

💡 Ideal Use Cases

If you are searching for a model to handle the following tasks, this model is built for you:

Real-Time Translation: Live transcription and translation of Urdu audio, podcasts, or lectures into English.
Voice-to-Text Document Generation: Converting dictated Urdu notes into structured English reports (the primary function of SpecDox).
Cross-Lingual ASR: Handling Pakistani accents and regional Urdu pronunciations with high accuracy.
Edge Deployment: Running high-accuracy audio translation on hardware with limited VRAM (Google Colab free tier, local RTX GPUs, etc.).

💻 How to Use in Python

from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load the SpecDox Whisper model
processor = WhisperProcessor.from_pretrained("Shzaib/SpecDox-Whisper-Medium")
model = WhisperForConditionalGeneration.from_pretrained("Shzaib/SpecDox-Whisper-Medium").to(device)

def translate_urdu_audio(audio_array, sampling_rate=16000):
    inputs = processor(audio_array, sampling_rate=sampling_rate, return_tensors="pt").to(device)
    
    # Force the decoder to translate Urdu to English
    forced_decoder_ids = processor.get_decoder_prompt_ids(language="urdu", task="translate")
    predicted_ids = model.generate(inputs["input_features"], forced_decoder_ids=forced_decoder_ids)
    
    return processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]

Downloads last month: -

Safetensors

Model size

0.8B params

Tensor type

F16

Model tree for Shzaib/SpecDox-Whisper-Medium

Base model

openai/whisper-medium

Adapter

(134)

this model