🇵🇰 Urdu LLaMA — Pakistan's First Fine-tuned Urdu LLM

Llama 3.2-1B fine-tuned on a Pakistani corpus — Urdu news, literature, Islamic knowledge, history, culture and code-switching (Urdu+English)

Model Summary

Urdu LLaMA is Pakistan's first open-source instruction-following language model fine-tuned specifically on Pakistani and Urdu content. It is built on top of meta-llama/Llama-3.2-1B-Instruct and fine-tuned using QLoRA (4-bit quantization + LoRA) on a curated Pakistani corpus.

The model understands and generates text in:

  • 🇵🇰 Urdu (primary language)
  • 🌐 English (secondary)
  • 💬 Code-switching (Urdu + English mixed — as spoken by Pakistanis daily)

🚀 Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "Nimra28/urdu-llama-pakistan"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

def ask_urdu(question):
    prompt = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>
آپ ایک مددگار پاکستانی AI اسسٹنٹ ہیں جو اردو اور انگریزی میں جواب دے سکتے ہیں۔<|eot_id|><|start_header_id|>user<|end_header_id|>
{question}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=200,
            temperature=0.7,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
        )
    response = tokenizer.decode(
        outputs[0][inputs["input_ids"].shape[1]:],
        skip_special_tokens=True
    )
    return response.strip()

# Try it
print(ask_urdu("پاکستان کی تاریخ بتائیں"))
print(ask_urdu("علامہ اقبال کون تھے؟"))
print(ask_urdu("Machine learning کیا ہے؟"))

📚 Training Data

The model was fine-tuned on a curated Pakistani corpus covering:

Domain Description
📰 Urdu News Pakistani current affairs, politics, sports
📚 Urdu Literature Poetry by Iqbal, Ghalib, Mir, Faiz Ahmed Faiz
🕌 Islamic Knowledge Namaz, Zakat, Hajj, Ramadan, Quran knowledge
🏛️ Pakistani History Independence, Quaid-e-Azam, national leaders
🎓 Education CSS exam, MDCAT, academic topics
🍛 Culture & Food Traditions, weddings, cuisine, festivals
💬 Code-switching Natural Urdu+English mixed conversations
🌍 Geography Provinces, cities, K2, national landmarks

⚙️ Training Details

Parameter Value
Base Model unsloth/Llama-3.2-1B-Instruct
Method QLoRA (4-bit quantization + LoRA)
LoRA Rank r=16, alpha=32
LoRA Target Modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
LoRA Dropout 0.05
Training Epochs 3
Learning Rate 2e-4
Batch Size 2 (effective 8 with gradient accumulation)
Max Sequence Length 512 tokens
Optimizer paged_adamw_8bit
Hardware Google Colab T4 GPU (free tier)
Training Time ~35 minutes
Quantization 4-bit NF4 with double quantization

📊 Evaluation

The model was evaluated on held-out Urdu test examples:

  • Perplexity: Improved significantly vs base model on Urdu text
  • Qualitative evaluation: Model responds correctly in Urdu to Pakistani cultural, historical, and Islamic questions
  • Code-switching: Handles natural Urdu+English mixing as spoken by Pakistanis

💬 Example Outputs

Question: پاکستان کب آزاد ہوا؟

پاکستان 14 اگست 1947 کو برطانوی استعمار سے آزاد ہوا۔ یہ دن ہر سال یوم آزادی کے طور پر منایا جاتا ہے۔ قائداعظم محمد علی جناح پاکستان کے پہلے گورنر جنرل بنے۔

Question: Machine learning کیا ہے؟

Machine learning یعنی مشینی سیکھ ایک ایسی technology ہے جس میں computer خود بخود data سے سیکھتا ہے۔ جیسے ہم experience سے سیکھتے ہیں، ویسے ہی machine بھی examples سے patterns پہچاننا سیکھتی ہے۔

Question: علامہ اقبال کون تھے؟

علامہ اقبال پاکستان کے قومی شاعر اور مفکر تھے۔ انہوں نے 1930 میں مسلمانوں کے لیے ایک الگ وطن کا خواب پیش کیا۔ انہیں شاعر مشرق اور مفکر پاکستان کہا جاتا ہے۔


⚠️ Limitations

  • Fine-tuned on a relatively small dataset (~30 curated examples) — larger dataset will improve quality
  • May occasionally mix Urdu and English even when pure Urdu is requested
  • Islamic knowledge is general — not a substitute for qualified scholars
  • Political opinions reflect training data — use with caution

🔮 Future Work

  • Scale training data to 10,000+ Urdu examples
  • Add Urdu news datasets from Dawn, Geo, ARY
  • Fine-tune on Urdu Wikipedia dump
  • Add Punjabi, Sindhi, Pashto support
  • Release larger 7B parameter version
  • Benchmark on Urdu NLP tasks

📄 Citation

If you use this model, please cite:

@misc{urdu-llama-pakistan-2024,
  title={Urdu LLaMA: Pakistan's First Fine-tuned Urdu Language Model},
  author={Nimra Tariq},
  year={2024},
  publisher={HuggingFace},
  url={https://huggingface.co/Nimra28/urdu-llama-pakistan}
}

👩‍💻 Developed By

Nimra Tariq — AI Engineer & Assistant Professor, Superior University, Pakistan


📜 License

This model is built on Llama 3.2 and follows the Llama 3.2 Community License.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Nimra28/urdu-llama-pakistan

Adapter
(403)
this model