Instructions to use Shzaib/SpecDox-Whisper-Medium with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use Shzaib/SpecDox-Whisper-Medium with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
SpecDox: Fast Urdu-to-English Speech Translation Model
This is a highly optimized, fine-tuned version of the OpenAI Whisper Medium model. It is explicitly trained to perform Automatic Speech Recognition (ASR) and Audio Translation, taking spoken Urdu (Ψ§Ψ±Ψ―Ω) and instantly converting it into written English text.
This model serves as the core audio-processing engine for SpecDox, a real-time Urdu-to-English Speech-to-Structured-Document system.
π Key Features & SEO Highlights
- High Speed & Low VRAM: Built on the 769M parameter Whisper Medium architecture. We chose this over the Large model to maximize GPU efficiency, ensure fast inference speeds, and allow for deployment on consumer-grade hardware.
- Massive Training Data: Trained on 127 hours of high-quality Urdu-to-English baseline speech data, expanded to a massive 172 hours through extensive data augmentation.
- PEFT / LoRA Optimized: Fine-tuned using Parameter-Efficient Fine-Tuning (LoRA adapters) and merged into FP16/BF16 weights for a lightweight footprint without sacrificing domain specificity.
π Evaluation & Performance
The table below outlines the performance of the fine-tuned SpecDox models against standard baseline architectures. Evaluation was conducted across four major benchmarks: Word Error Rate (WER), BLEU score, METEOR, and BERTScore F1.
| Model | WER% β | BLEU β | METEOR β | BERTScore F1 β | Rank |
|---|---|---|---|---|---|
| SpecDox-Whisper-Medium | 36.25 | 53.30 | 0.7804 | 0.9405 | #1 |
| Faster Whisper (SpecDox) | 36.28 | 53.24 | 0.7811 | 0.9402 | #2 |
| Whisper Large-v3 | 42.88 | 46.86 | 0.7105 | 0.9270 | #3 |
| Whisper Medium (Baseline) | 45.33 | 44.16 | 0.6882 | 0.9226 | #4 |
| SeamlessM4T Medium | 72.04 | 18.84 | 0.3697 | 0.8429 | #5 |
Engineering Takeaway: Despite being a lighter architecture, the fine-tuned SpecDox Medium model outperforms the baseline Whisper Large-v3 by a massive 6.63% absolute reduction in WER and yields significantly higher translation quality metrics (BLEU/METEOR). This justifies the choice of Whisper Medium for production environments requiring fast inference speeds and low GPU footprints.
π‘ Ideal Use Cases
If you are searching for a model to handle the following tasks, this model is built for you:
- Real-Time Translation: Live transcription and translation of Urdu audio, podcasts, or lectures into English.
- Voice-to-Text Document Generation: Converting dictated Urdu notes into structured English reports (the primary function of SpecDox).
- Cross-Lingual ASR: Handling Pakistani accents and regional Urdu pronunciations with high accuracy.
- Edge Deployment: Running high-accuracy audio translation on hardware with limited VRAM (Google Colab free tier, local RTX GPUs, etc.).
π» How to Use in Python
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load the SpecDox Whisper model
processor = WhisperProcessor.from_pretrained("Shzaib/SpecDox-Whisper-Medium")
model = WhisperForConditionalGeneration.from_pretrained("Shzaib/SpecDox-Whisper-Medium").to(device)
def translate_urdu_audio(audio_array, sampling_rate=16000):
inputs = processor(audio_array, sampling_rate=sampling_rate, return_tensors="pt").to(device)
# Force the decoder to translate Urdu to English
forced_decoder_ids = processor.get_decoder_prompt_ids(language="urdu", task="translate")
predicted_ids = model.generate(inputs["input_features"], forced_decoder_ids=forced_decoder_ids)
return processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
- Downloads last month
- -
Model tree for Shzaib/SpecDox-Whisper-Medium
Base model
openai/whisper-medium