Lafzyn

Lafzyn is a fine-tuned Qwen3.5-0.8B model that converts Urdu script into IPA phonetic transcription.

Model Details

Property Value
Base model Qwen3.5-0.8B
Task Urdu Grapheme-to-Phoneme (G2P)
Output format IPA transcription
Training data 100,000 Urdu IPA pairs

Evaluation Results

Evaluated on 500 held out samples from the phoneme map using Phoneme Error Rate (PER) character level edit distance on IPA strings (lower is better).

Metric Value
Mean PER 16.94%
Perfect transcriptions (PER = 0) 88 / 500 - 17.6%
Near perfect (PER < 10%) 166 / 500 - 33.2%

Sample Outputs

Best predictions (PER = 0.000):

Urdu Expected IPA Predicted IPA
مختار احمد mʊxˈt̪aːr ˈæhməd mʊxˈt̪aːr ˈæhməd
قابل رحم qɑːˈbɪl rɛˈhəm qɑːˈbɪl rɛˈhəm
رواں ہفتے rəˈʋãː ˈhəft̪eː rəˈʋãː ˈhəft̪eː
دفاعی کمیشن d̪ɪˈfaː.iː kəˈmɪ.ʃən d̪ɪˈfaː.iː kəˈmɪ.ʃən
نفیس المزاج nəˈfiːs əlmɪˈzaːd͡ʒ nəˈfiːs əlmɪˈzaːd͡ʒ

Hardest cases (Arabic loanwords & rare compounds):

Urdu Expected IPA Predicted IPA PER
ائمہ سبعہ aɪˈʔɪm.maː ˈsab.ʕa ˈɛːmɑː sɪˈbɑː 0.722
نفع المصرف ˈnafʕaː alˈmasˤrif nəˈfɑːl ʔalˈmɪrɑːf 0.667
میٹر کیولیٹ ˈmeːʈər ˈkjuːlɪt meːʈər kiːˈloːɛt̪ 0.562

High error cases are predominantly rare Arabic origin compound words and technical transliterations that are underrepresented in training data.


Quick Start

from transformers import AutoTokenizer, pipeline
import torch

pipe = pipeline(
    "text-generation",
    model="mahwizzzz/qwen3.5-g2p",
    torch_dtype=torch.float16,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are an expert Urdu linguist and phonetician. Convert the given Urdu word or phrase into its IPA transcription. Output only the IPA string, nothing else."},
    {"role": "user",   "content": "پاکستان"},
]
out = pipe(messages, max_new_tokens=64, do_sample=False)
print(out[0]["generated_text"][-1]["content"])  # → pɑːkɪsˈt̪aːn

Phoneme Inventory

The model covers the full standard Urdu IPA phoneme set:

  • Stops: b p t̪ ʈ d̪ ɖ k ɡ q ʔ
  • Fricatives: f s ʃ z ʒ x ɣ ɦ h ħ ʕ
  • Affricates: tʃ dʒ
  • Nasals: m n ɳ ŋ n̪
  • Liquids & glides: r ɾ ɽ l w j ʋ
  • Vowels: ə ɪ ʊ aː iː uː eː oː ɛ ɔ æ
  • Diacritics: ː (length) ̃ (nasalization) ˈ ˌ (stress)

Limitations

  • Arabic origin loanwords and rare compounds show higher error rates (PER > 0.5)
  • Optimised for Modern Standard Urdu; regional dialects and heavy code-switching may degrade accuracy

Related Resources

Resource Link
🤗 Live Demo spaces/mahwizzzz/lafzyn
📦 GGUF (llama.cpp / Ollama) mahwizzzz/lafzyn-gguf
🏋 Base model Qwen/Qwen3.5-0.8B

Citation

@misc{lafzyn2026,
  title   = {Lafzyn: Urdu Grapheme-to-Phoneme with Qwen3.5-0.8B},
  author  = {Mahwiz Khalil},
  year    = {2026},
  url     = {https://huggingface.co/mahwizzzz/lafzyn},
  note    = {Fine-tuned on 100k Urdu IPA pairs, mean PER 16.94\%}
}
Downloads last month
206
Safetensors
Model size
0.9B params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mahwizzzz/lafzyn

Finetuned
(211)
this model
Quantizations
1 model

Space using mahwizzzz/lafzyn 1