Indic General Assistant 3B v1

A multilingual Indian language assistant fine-tuned from Qwen/Qwen2.5-3B-Instruct using QLoRA. Designed to handle Hindi, English, Bengali, Tamil, Telugu, Marathi, Gujarati, and code-mixed (Hinglish) inputs naturally.

Key Differentiators

Unlike existing Indic models (Krutrim, Airavata, Sarvam), this model:

  1. Code-Mixing Awareness: Detects and matches the user's language mixing ratio. If you write 70% Hindi 30% English, the model responds in the same style.
  2. Transliteration Handling: Understands both Roman Hindi ("mujhe samjhao") and Devanagari (मुझे समझाओ).
  3. Cultural Context: Trained on Indian-specific contexts — festivals, GST/PAN/Aadhaar, regional customs, IPL, Bollywood.
  4. Cross-lingual: Can follow instructions in one language and respond in another.
  5. Best tokenizer for Indic: Qwen2.5's 151K vocab achieves ~0.5 NSL on Devanagari, far better than Llama-3's 1.4-1.8x on Dravidian scripts.

Supported Languages

Language Script Training %
Hindi Devanagari + Roman 40.1%
English Latin 24.9%
Hinglish (code-mixed) Mixed 15.1%
Bengali Bengali script 4.1%
Tamil Tamil script 4.0%
Telugu Telugu script 4.0%
Marathi Devanagari 3.9%
Gujarati Gujarati script 3.9%

Quick Start

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "kshitij230/indic-general-assistant-3b-v1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")

messages = [
    {"role": "system", "content": "You are a helpful multilingual assistant for Indian users."},
    {"role": "user", "content": "Mujhe batao ki machine learning kaise kaam karta hai, simple language mein"}
]

input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7, top_p=0.9, do_sample=True)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Example Conversations

Hinglish (Code-Mixed)

User: Mujhe help chahiye Python mein list sort karne mein
Assistant: List sort karna Python mein bahut easy hai. Do tarike hain: my_list.sort() — yeh original list modify karta hai sorted(my_list) — yeh new sorted list return karta hai

Hindi (Devanagari)

User: दिवाली के बारे में बताओ
Assistant: दिवाली भारत का सबसे बड़ा त्यौहार है जिसे 'रोशनी का त्यौहार' कहते हैं...

Financial (Indian context)

User: Mutual funds mein invest kaise karein?
Assistant: KYC complete karo (PAN + Aadhaar), Zerodha/Groww pe account banao, SIP se start karo...

Training Details

Parameter Value
Base Model Qwen/Qwen2.5-3B-Instruct (3.1B params)
Method QLoRA (4-bit NF4 + LoRA r=16, alpha=32)
Target Modules q,k,v,o_proj + gate,up,down_proj
Learning Rate 2e-4 (cosine schedule)
Effective Batch 8 (2 × 4 grad_accum)
Max Seq Length 2048
Optimizer Paged AdamW 32-bit
Precision bfloat16

Dataset Sources

Source Languages Type
ai4bharat/indic-instruct-data-v0.1 Hindi + English Instruction-following
CohereLabs/aya_dataset 12 Indic languages Instruction-following
Abhishekcr448/Hinglish-Everyday-Conversations-1M Hinglish Conversations
findnitai/english-to-hinglish En↔Hinglish Translation
festvox/cmu_hinglish_dog Hinglish Grounded dialog
Hand-crafted synthetic Code-mixed 10 domains (tech, health, finance, etc.)

Data Cleaning Applied

  • Duplicate removal on instruction field
  • Output length filter (>20 chars)
  • ASCII-only filter for Indic splits
  • Unicode NFC normalization for Devanagari
  • Script verification per language
  • Toxicity filtering

Training Space

Interactive training dashboard: kshitij230/indic-assistant-trainer

Base Model Selection Reasoning

Model Indic Tokenizer NSL Gated? License Decision
Qwen2.5-3B-Instruct ~0.5 (best) No Research ✅ Selected
Llama-3.2-3B-Instruct 0.59 Hindi, 1.4-1.8 Dravidian Yes Llama 3.2 ❌ Poor Dravidian tokenizer
Gemma-2-2B-it ~0.55 Yes (gated) Gemma ❌ Gated access blocks deployment

Limitations

  1. License: Qwen Research License (non-commercial). For production, request commercial license from Alibaba Cloud or upgrade to Qwen2.5-7B (Apache 2.0)
  2. Training scale: 4,800 examples — production would benefit from 50K+
  3. Bengali/Marathi underrepresented: ~190 examples each
  4. SFT-only: No RLHF/DPO alignment
  5. May hallucinate: Verify critical information

Next Steps (v2)

  1. Scale to 100K+ examples (use full 49K prepared dataset + more)
  2. Switch to Qwen2.5-7B-Instruct (Apache 2.0, better performance)
  3. Add DPO alignment training
  4. Include Kannada, Malayalam, Punjabi, Urdu
  5. Benchmark on IndicXTREME, MILU, MMLU
  6. GGUF quantization for CPU deployment
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

Model tree for kshitij230/indic-general-assistant-3b-v1

Base model

Qwen/Qwen2.5-3B
Finetuned
(1276)
this model

Datasets used to train kshitij230/indic-general-assistant-3b-v1

Space using kshitij230/indic-general-assistant-3b-v1 1