Indic General Assistant 3B v1
A multilingual Indian language assistant fine-tuned from Qwen/Qwen2.5-3B-Instruct using QLoRA. Designed to handle Hindi, English, Bengali, Tamil, Telugu, Marathi, Gujarati, and code-mixed (Hinglish) inputs naturally.
Key Differentiators
Unlike existing Indic models (Krutrim, Airavata, Sarvam), this model:
- Code-Mixing Awareness: Detects and matches the user's language mixing ratio. If you write 70% Hindi 30% English, the model responds in the same style.
- Transliteration Handling: Understands both Roman Hindi ("mujhe samjhao") and Devanagari (मुझे समझाओ).
- Cultural Context: Trained on Indian-specific contexts — festivals, GST/PAN/Aadhaar, regional customs, IPL, Bollywood.
- Cross-lingual: Can follow instructions in one language and respond in another.
- Best tokenizer for Indic: Qwen2.5's 151K vocab achieves ~0.5 NSL on Devanagari, far better than Llama-3's 1.4-1.8x on Dravidian scripts.
Supported Languages
| Language | Script | Training % |
|---|---|---|
| Hindi | Devanagari + Roman | 40.1% |
| English | Latin | 24.9% |
| Hinglish (code-mixed) | Mixed | 15.1% |
| Bengali | Bengali script | 4.1% |
| Tamil | Tamil script | 4.0% |
| Telugu | Telugu script | 4.0% |
| Marathi | Devanagari | 3.9% |
| Gujarati | Gujarati script | 3.9% |
Quick Start
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "kshitij230/indic-general-assistant-3b-v1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")
messages = [
{"role": "system", "content": "You are a helpful multilingual assistant for Indian users."},
{"role": "user", "content": "Mujhe batao ki machine learning kaise kaam karta hai, simple language mein"}
]
input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7, top_p=0.9, do_sample=True)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Example Conversations
Hinglish (Code-Mixed)
User: Mujhe help chahiye Python mein list sort karne mein
Assistant: List sort karna Python mein bahut easy hai. Do tarike hain:
my_list.sort() — yeh original list modify karta hai
sorted(my_list) — yeh new sorted list return karta hai
Hindi (Devanagari)
User: दिवाली के बारे में बताओ
Assistant: दिवाली भारत का सबसे बड़ा त्यौहार है जिसे 'रोशनी का त्यौहार' कहते हैं...
Financial (Indian context)
User: Mutual funds mein invest kaise karein?
Assistant: KYC complete karo (PAN + Aadhaar), Zerodha/Groww pe account banao, SIP se start karo...
Training Details
| Parameter | Value |
|---|---|
| Base Model | Qwen/Qwen2.5-3B-Instruct (3.1B params) |
| Method | QLoRA (4-bit NF4 + LoRA r=16, alpha=32) |
| Target Modules | q,k,v,o_proj + gate,up,down_proj |
| Learning Rate | 2e-4 (cosine schedule) |
| Effective Batch | 8 (2 × 4 grad_accum) |
| Max Seq Length | 2048 |
| Optimizer | Paged AdamW 32-bit |
| Precision | bfloat16 |
Dataset Sources
| Source | Languages | Type |
|---|---|---|
| ai4bharat/indic-instruct-data-v0.1 | Hindi + English | Instruction-following |
| CohereLabs/aya_dataset | 12 Indic languages | Instruction-following |
| Abhishekcr448/Hinglish-Everyday-Conversations-1M | Hinglish | Conversations |
| findnitai/english-to-hinglish | En↔Hinglish | Translation |
| festvox/cmu_hinglish_dog | Hinglish | Grounded dialog |
| Hand-crafted synthetic | Code-mixed | 10 domains (tech, health, finance, etc.) |
Data Cleaning Applied
- Duplicate removal on instruction field
- Output length filter (>20 chars)
- ASCII-only filter for Indic splits
- Unicode NFC normalization for Devanagari
- Script verification per language
- Toxicity filtering
Training Space
Interactive training dashboard: kshitij230/indic-assistant-trainer
Base Model Selection Reasoning
| Model | Indic Tokenizer NSL | Gated? | License | Decision |
|---|---|---|---|---|
| Qwen2.5-3B-Instruct | ~0.5 (best) | No | Research | ✅ Selected |
| Llama-3.2-3B-Instruct | 0.59 Hindi, 1.4-1.8 Dravidian | Yes | Llama 3.2 | ❌ Poor Dravidian tokenizer |
| Gemma-2-2B-it | ~0.55 | Yes (gated) | Gemma | ❌ Gated access blocks deployment |
Limitations
- License: Qwen Research License (non-commercial). For production, request commercial license from Alibaba Cloud or upgrade to Qwen2.5-7B (Apache 2.0)
- Training scale: 4,800 examples — production would benefit from 50K+
- Bengali/Marathi underrepresented: ~190 examples each
- SFT-only: No RLHF/DPO alignment
- May hallucinate: Verify critical information
Next Steps (v2)
- Scale to 100K+ examples (use full 49K prepared dataset + more)
- Switch to Qwen2.5-7B-Instruct (Apache 2.0, better performance)
- Add DPO alignment training
- Include Kannada, Malayalam, Punjabi, Urdu
- Benchmark on IndicXTREME, MILU, MMLU
- GGUF quantization for CPU deployment