Instructions to use tusharislampure29/qwen2.5-1.5b-marathi-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use tusharislampure29/qwen2.5-1.5b-marathi-instruct with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("unsloth/qwen2.5-1.5b-instruct-unsloth-bnb-4bit") model = PeftModel.from_pretrained(base_model, "tusharislampure29/qwen2.5-1.5b-marathi-instruct") - Notebooks
- Google Colab
- Kaggle
- Qwen2.5-1.5B Marathi-Instruct (QLoRA adapter)
- Quick use
- Training data
- Training recipe
- Evaluation
- Training-side: eval_loss on held-out val (n=1500)
- Auto-metrics on held-out test split (n=500, greedy, max 256 new tokens; Kaggle T4)
- Behavioural eval — 50-prompt hand-curated Marathi cultural set (Kaggle T4)
- GPT-4o 4-axis rubric on cultural set (n=50, base + tuned)
- GPT-4o pairwise A/B on 100 held-out test prompts
- Tokenizer efficiency note
- Limitations
- Intended use & out-of-scope
- Citation
- Acknowledgements
- Quick use
Qwen2.5-1.5B Marathi-Instruct (QLoRA adapter)
A LoRA adapter for Qwen/Qwen2.5-1.5B-Instruct that teaches the model to follow instructions in Marathi (मराठी). Trained on a free Colab T4 — no premium hardware needed.
Why this exists: Marathi has 83M+ native speakers (3rd most-spoken language in India), but the Indic LLM ecosystem ships Hindi by default and treats Marathi as second-tier. This adapter demonstrates the gap is curated instruction data + the right training stack, not raw model capacity.
Quick use
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
base_id = "Qwen/Qwen2.5-1.5B-Instruct"
adapter_id = "tusharislampure29/qwen2.5-1.5b-marathi-instruct"
tok = AutoTokenizer.from_pretrained(base_id)
model = AutoModelForCausalLM.from_pretrained(base_id, torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(model, adapter_id)
model.eval()
msgs = [
{"role": "system", "content": "तुम्ही एक उपयुक्त AI सहाय्यक आहात जो मराठीत स्पष्ट आणि अचूक उत्तरे देतो."},
{"role": "user", "content": "महाराष्ट्राची राजधानी कोणती आहे आणि तिथली लोकसंख्या किती आहे?"},
]
text = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
inputs = tok(text, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=200, do_sample=True, temperature=0.7, top_p=0.9,
pad_token_id=tok.eos_token_id)
print(tok.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Training data
- Source: Marathi slice of
CohereForAI/aya_collection_language_split(~3.5M rows). - Pipeline: Devanagari-script filter → length filter (10-2000 chars) → random pool of 60k → MinHash dedupe (Jaccard 0.85) → final 30k.
- Splits: 27k train / 1.5k val / 1.5k test, all formatted in Qwen2.5 ChatML with a Marathi system prompt.
- Public release:
tusharislampure29/marathi-instruct-30k.
Training recipe
| Setting | Value |
|---|---|
| Base | Qwen/Qwen2.5-1.5B-Instruct |
| Quantization | nf4, double-quant, compute dtype fp16 (T4) |
| LoRA r / α / dropout | 16 / 32 / 0.05 |
| Target modules | q_proj k_proj v_proj o_proj gate_proj up_proj down_proj |
| Trainable params | ~18.5M (1.18% of base) |
| Max seq length | 1024 |
| Optimizer | AdamW (8-bit, paged) |
| LR / scheduler / warmup | 2e-4 / cosine / 0.03 |
| Effective batch size | 32 (4 × 8 grad accum) |
| Epochs | 3 |
| Hardware | 1× Colab free T4 16GB |
| Stack | Unsloth + TRL SFTTrainer |
| Best-checkpoint criterion | lowest eval_loss |
Mid-training checkpoints are pushed to this repo as checkpoint-{step}/ so a Colab session disconnect ≠ lost work.
Evaluation
Training-side: eval_loss on held-out val (n=1500)
| step | 200 | 400 | 600 | 800 | 1000 | 1200 | 1400 | 1600 | 1800 | 2000 | 2200 | 2400 | 2532 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| eval_loss | 0.767 | 0.702 | 0.669 | 0.645 | 0.632 | 0.618 | 0.608 | 0.598 | 0.595 | 0.591 | 0.589 | 0.587 | 0.587 |
Monotonic decrease (−23.5% total). Final checkpoint is best (load_best_model_at_end=True); the model was still improving at end of 3 epochs.
Auto-metrics on held-out test split (n=500, greedy, max 256 new tokens; Kaggle T4)
Held-out 500-prompt slice of the test split. Greedy decoding, system prompt in Marathi, A/B toggled via model.disable_adapter() ctx on the same loaded model so the two columns differ only by the LoRA adapter.
| Metric | Base | Tuned | Δ (abs) | Δ (rel) |
|---|---|---|---|---|
| ROUGE-L F1 | 0.0314 | 0.0441 | +0.0127 | +40 % |
| chrF | 23.64 | 33.43 | +9.79 | +41 % |
| sacreBLEU | 5.13 | 19.30 | +14.16 | +276 % |
The BLEU jump from 5.13 to 19.30 is the headline auto-metric finding: the tuned adapter produces outputs that are nearly four times closer to the reference text on n-gram overlap. The full 500 generation pairs are reproducible from notebooks/eval_full_kaggle.ipynb.
Behavioural eval — 50-prompt hand-curated Marathi cultural set (Kaggle T4)
Greedy generation, max 200 tokens, system prompt तुम्ही एक उपयुक्त AI सहाय्यक…. Keyword recall scores the fraction of expected keywords (from each prompt's expected_keywords list) that appear in the model output.
| Category | n | Base | Tuned | Δ |
|---|---|---|---|---|
| overall | 50 | 0.053 | 0.050 | −0.003 |
| geography | 10 | 0.000 | 0.000 | 0.000 |
| history | 10 | 0.033 | 0.000 | −0.033 |
| culture | 10 | 0.067 | 0.000 | −0.067 |
| language | 10 | 0.100 | 0.100 | 0.000 |
| reasoning | 10 | 0.067 | 0.150 | +0.083 (+125 % rel) |
Interpretation. Keyword recall is an honest but brittle metric: it punishes paraphrase and rewards verbosity. Qualitatively (see the 50 generation pairs in the project repo's eval/results/), the tuned model is consistently more fluent, more on-topic, and more direct — but its factual recall on long-tail Maharashtra trivia (specific saints' villages, exact district counts, Peshwa lineages) is capped by the 1.5B base model's parametric knowledge. The clearest behavioural improvement is on Marathi-language reasoning prompts, where the tuned adapter gives correct direct answers (e.g. १२ × ३० = ३६० रुपये) where the base model code-mixes Hindi and hallucinates.
The full evaluation set covers Maharashtra geography, history (Shivaji, social reformers), culture (festivals, food, warkari tradition), language (idioms, classic Marathi authors), and Marathi-language reasoning — see eval/cultural_marathi_eval.json.
GPT-4o 4-axis rubric on cultural set (n=50, base + tuned)
GPT-4o scores each response 1-5 on fluency / factuality / cultural_accuracy / instruction_following. The judge is calibrated strictly — "give a 1 if the response is in the wrong language, gibberish, or empty" — which pulls both columns toward the floor on the long-tail Maharashtra trivia. The direction and relative gap are what's signal.
| Axis | Base | Tuned | Δ (abs) | Δ (rel) |
|---|---|---|---|---|
| fluency | 1.22 | 1.40 | +0.18 | +15 % |
| factuality | 1.12 | 1.22 | +0.10 | +9 % |
| cultural_accuracy | 1.16 | 1.28 | +0.12 | +10 % |
| instruction_following | 1.08 | 1.28 | +0.20 | +19 % |
Per-category breakdown on instruction_following — the axis the QLoRA SFT objective most directly targets:
| category | base | tuned | Δ |
|---|---|---|---|
| geography | 1.1 | 1.2 | +0.1 |
| history | 1.0 | 1.0 | 0.0 |
| culture | 1.0 | 1.1 | +0.1 |
| language | 1.2 | 1.4 | +0.2 |
| reasoning | 1.1 | 1.7 | +0.6 |
The strongest behavioural lift is on the Marathi-language reasoning category — fluency 1.2→1.7, factuality 1.2→1.5, instruction_following 1.1→1.7. This matches the keyword-recall finding above. On long-tail trivia (Shivaji-era history, specific saints' villages, exact district counts) both models stay near the rubric floor because 1.5B parameters cannot store those facts at all — that's a model-capacity ceiling, not a fine-tuning one.
GPT-4o pairwise A/B on 100 held-out test prompts
Each (prompt, base_response, tuned_response) triple is sent to GPT-4o, which picks A / B / TIE on Marathi fluency, factual accuracy, and helpfulness. A/B order is randomized per prompt to neutralise position bias (src/eval_harness.py::llm_judge).
| outcome | count | rate |
|---|---|---|
| Tuned wins | 71 / 100 | 71 % |
| Base wins | 19 / 100 | 19 % |
| Ties | 10 / 100 | 10 % |
The tuned adapter is preferred over the base model 3.7× more often than the reverse on held-out Marathi instruction prompts — and this is the cleanest behavioural signal in the whole release, because it directly compares full outputs head-to-head instead of measuring overlap with a single reference.
Tokenizer efficiency note
Qwen2.5's tokenizer was trained on English-heavy data. On Marathi it is 4.79× less efficient by character than English:
| Marathi | English | |
|---|---|---|
| chars/token | 1.04 | 4.98 |
| tokens/word | 6.36 | 1.16 |
| fragmentation rate | 98.8% | 16.5% |
A 1024-token context fits 161 Marathi words vs 882 English words. This is a hard ceiling on context utility for Marathi users; a v2 of this project should ship a Marathi-extended tokenizer. Full analysis: src/tokenizer_analysis.py.
Limitations
- 1.5B parameters — strong fluency in everyday Marathi, but limited factual recall on long-tail trivia. Pair with retrieval for any production use.
- Tokenizer mismatch — see above. Effective context is ~1/5 of the nominal length on Marathi inputs.
- Single epoch on 27k samples is enough for instruction-following style; deeper knowledge requires either a bigger base or a continued-pretraining stage on a Marathi corpus.
- English fall-through — for prompts that are clearly out-of-domain (code, math reasoning beyond simple arithmetic), the model can still drift into English. The system prompt helps but is not a hard guarantee.
- No safety tuning beyond the base model.
Intended use & out-of-scope
- ✅ Marathi Q&A, summarisation, simple instruction-following, conversational assistants, demo / educational projects.
- ❌ Medical, legal, financial advice. Anything that requires guaranteed factual correctness without verification.
Citation
@misc{islampure2026marathiqwen,
author = {Tushar Islampure},
title = {Qwen2.5-1.5B Marathi-Instruct (QLoRA adapter)},
year = 2026,
publisher = {Hugging Face},
url = {https://huggingface.co/tusharislampure29/qwen2.5-1.5b-marathi-instruct},
}
Acknowledgements
- AI4Bharat & CohereForAI for the Indic instruction datasets.
- Alibaba Qwen team for Qwen2.5 (Apache 2.0).
- Unsloth for the T4-friendly QLoRA stack.
- OpenAI for GPT-4o (used as the evaluation judge for the rubric and the pairwise A/B).
Project source code (data prep, eval harness, notebook, decisions log): github.com/tusharislampure29/marathi-instruct-llm.
- Downloads last month
- 2