Lafzyn

Lafzyn is a fine-tuned Qwen3.5-0.8B model that converts Urdu script into IPA phonetic transcription.

Model Details

Property	Value
Base model	`Qwen3.5-0.8B`
Task	Urdu Grapheme-to-Phoneme (G2P)
Output format	IPA transcription
Training data	100,000 Urdu IPA pairs

Evaluation Results

Evaluated on 500 held out samples from the phoneme map using Phoneme Error Rate (PER) character level edit distance on IPA strings (lower is better).

Metric	Value
Mean PER	16.94%
Perfect transcriptions (PER = 0)	88 / 500 - 17.6%
Near perfect (PER < 10%)	166 / 500 - 33.2%

Sample Outputs

Best predictions (PER = 0.000):

Urdu	Expected IPA	Predicted IPA
مختار احمد	`mʊxˈt̪aːr ˈæhməd`	`mʊxˈt̪aːr ˈæhməd`
قابل رحم	`qɑːˈbɪl rɛˈhəm`	`qɑːˈbɪl rɛˈhəm`
رواں ہفتے	`rəˈʋãː ˈhəft̪eː`	`rəˈʋãː ˈhəft̪eː`
دفاعی کمیشن	`d̪ɪˈfaː.iː kəˈmɪ.ʃən`	`d̪ɪˈfaː.iː kəˈmɪ.ʃən`
نفیس المزاج	`nəˈfiːs əlmɪˈzaːd͡ʒ`	`nəˈfiːs əlmɪˈzaːd͡ʒ`

Hardest cases (Arabic loanwords & rare compounds):

Urdu	Expected IPA	Predicted IPA	PER
ائمہ سبعہ	`aɪˈʔɪm.maː ˈsab.ʕa`	`ˈɛːmɑː sɪˈbɑː`	0.722
نفع المصرف	`ˈnafʕaː alˈmasˤrif`	`nəˈfɑːl ʔalˈmɪrɑːf`	0.667
میٹر کیولیٹ	`ˈmeːʈər ˈkjuːlɪt`	`meːʈər kiːˈloːɛt̪`	0.562

High error cases are predominantly rare Arabic origin compound words and technical transliterations that are underrepresented in training data.

Quick Start

from transformers import AutoTokenizer, pipeline
import torch

pipe = pipeline(
    "text-generation",
    model="mahwizzzz/qwen3.5-g2p",
    torch_dtype=torch.float16,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are an expert Urdu linguist and phonetician. Convert the given Urdu word or phrase into its IPA transcription. Output only the IPA string, nothing else."},
    {"role": "user",   "content": "پاکستان"},
]
out = pipe(messages, max_new_tokens=64, do_sample=False)
print(out[0]["generated_text"][-1]["content"])  # → pɑːkɪsˈt̪aːn

Phoneme Inventory

The model covers the full standard Urdu IPA phoneme set:

Stops: b p t̪ ʈ d̪ ɖ k ɡ q ʔ
Fricatives: f s ʃ z ʒ x ɣ ɦ h ħ ʕ
Affricates: tʃ dʒ
Nasals: m n ɳ ŋ n̪
Liquids & glides: r ɾ ɽ l w j ʋ
Vowels: ə ɪ ʊ aː iː uː eː oː ɛ ɔ æ
Diacritics: ː (length) ̃ (nasalization) ˈ ˌ (stress)

Limitations

Arabic origin loanwords and rare compounds show higher error rates (PER > 0.5)
Optimised for Modern Standard Urdu; regional dialects and heavy code-switching may degrade accuracy

Related Resources

Resource	Link
🤗 Live Demo	spaces/mahwizzzz/lafzyn
📦 GGUF (llama.cpp / Ollama)	mahwizzzz/lafzyn-gguf
🏋 Base model	Qwen/Qwen3.5-0.8B

Citation

@misc{lafzyn2026,
  title   = {Lafzyn: Urdu Grapheme-to-Phoneme with Qwen3.5-0.8B},
  author  = {Mahwiz Khalil},
  year    = {2026},
  url     = {https://huggingface.co/mahwizzzz/lafzyn},
  note    = {Fine-tuned on 100k Urdu IPA pairs, mean PER 16.94\%}
}

Downloads last month: 206

Safetensors

Model size

0.9B params

Tensor type

F32

BF16

Model tree for mahwizzzz/lafzyn

Base model

Qwen/Qwen3.5-0.8B-Base

Finetuned

Qwen/Qwen3.5-0.8B

Finetuned

(211)

this model

Quantizations

1 model

mahwizzzz
/

lafzyn