Lafzyn
Lafzyn is a fine-tuned Qwen3.5-0.8B model that converts Urdu script into IPA phonetic transcription.
Model Details
| Property | Value |
|---|---|
| Base model | Qwen3.5-0.8B |
| Task | Urdu Grapheme-to-Phoneme (G2P) |
| Output format | IPA transcription |
| Training data | 100,000 Urdu IPA pairs |
Evaluation Results
Evaluated on 500 held out samples from the phoneme map using Phoneme Error Rate (PER) character level edit distance on IPA strings (lower is better).
| Metric | Value |
|---|---|
| Mean PER | 16.94% |
| Perfect transcriptions (PER = 0) | 88 / 500 - 17.6% |
| Near perfect (PER < 10%) | 166 / 500 - 33.2% |
Sample Outputs
Best predictions (PER = 0.000):
| Urdu | Expected IPA | Predicted IPA |
|---|---|---|
| مختار احمد | mʊxˈt̪aːr ˈæhməd |
mʊxˈt̪aːr ˈæhməd |
| قابل رحم | qɑːˈbɪl rɛˈhəm |
qɑːˈbɪl rɛˈhəm |
| رواں ہفتے | rəˈʋãː ˈhəft̪eː |
rəˈʋãː ˈhəft̪eː |
| دفاعی کمیشن | d̪ɪˈfaː.iː kəˈmɪ.ʃən |
d̪ɪˈfaː.iː kəˈmɪ.ʃən |
| نفیس المزاج | nəˈfiːs əlmɪˈzaːd͡ʒ |
nəˈfiːs əlmɪˈzaːd͡ʒ |
Hardest cases (Arabic loanwords & rare compounds):
| Urdu | Expected IPA | Predicted IPA | PER |
|---|---|---|---|
| ائمہ سبعہ | aɪˈʔɪm.maː ˈsab.ʕa |
ˈɛːmɑː sɪˈbɑː |
0.722 |
| نفع المصرف | ˈnafʕaː alˈmasˤrif |
nəˈfɑːl ʔalˈmɪrɑːf |
0.667 |
| میٹر کیولیٹ | ˈmeːʈər ˈkjuːlɪt |
meːʈər kiːˈloːɛt̪ |
0.562 |
High error cases are predominantly rare Arabic origin compound words and technical transliterations that are underrepresented in training data.
Quick Start
from transformers import AutoTokenizer, pipeline
import torch
pipe = pipeline(
"text-generation",
model="mahwizzzz/qwen3.5-g2p",
torch_dtype=torch.float16,
device_map="auto",
)
messages = [
{"role": "system", "content": "You are an expert Urdu linguist and phonetician. Convert the given Urdu word or phrase into its IPA transcription. Output only the IPA string, nothing else."},
{"role": "user", "content": "پاکستان"},
]
out = pipe(messages, max_new_tokens=64, do_sample=False)
print(out[0]["generated_text"][-1]["content"]) # → pɑːkɪsˈt̪aːn
Phoneme Inventory
The model covers the full standard Urdu IPA phoneme set:
- Stops: b p t̪ ʈ d̪ ɖ k ɡ q ʔ
- Fricatives: f s ʃ z ʒ x ɣ ɦ h ħ ʕ
- Affricates: tʃ dʒ
- Nasals: m n ɳ ŋ n̪
- Liquids & glides: r ɾ ɽ l w j ʋ
- Vowels: ə ɪ ʊ aː iː uː eː oː ɛ ɔ æ
- Diacritics: ː (length) ̃ (nasalization) ˈ ˌ (stress)
Limitations
- Arabic origin loanwords and rare compounds show higher error rates (PER > 0.5)
- Optimised for Modern Standard Urdu; regional dialects and heavy code-switching may degrade accuracy
Related Resources
| Resource | Link |
|---|---|
| 🤗 Live Demo | spaces/mahwizzzz/lafzyn |
| 📦 GGUF (llama.cpp / Ollama) | mahwizzzz/lafzyn-gguf |
| 🏋 Base model | Qwen/Qwen3.5-0.8B |
Citation
@misc{lafzyn2026,
title = {Lafzyn: Urdu Grapheme-to-Phoneme with Qwen3.5-0.8B},
author = {Mahwiz Khalil},
year = {2026},
url = {https://huggingface.co/mahwizzzz/lafzyn},
note = {Fine-tuned on 100k Urdu IPA pairs, mean PER 16.94\%}
}
- Downloads last month
- 206