Qwen2.5-3B Hinglish Instruction-Tuned (QLoRA)

LoRA adapter fine-tuning Qwen2.5-3B-Instruct for natural code-mixed Hindi-English (Hinglish) conversation. Trained on 10,594 synthetic Hinglish instruction examples covering casual chat, customer support, Q&A, and sentiment classification.

Headline result

Metric Qwen-base GPT-4o-mini GPT-4o This model
Hinglish marker density 8.9% 29.5% 24.6% 31.6%
English drift rate 32% 0% 4% 0%
Devanagari injection bug 12.5% 0% 2.5% 0%
Claude judge register score (/5) 1.24 2.50 2.12 3.98
Claude judge total (/20) 6.72 13.56 12.90 12.48

Bottom line: Matches or exceeds GPT-4o-mini on Hinglish register naturalness, with comparable to ~3.4× lower serving cost depending on infrastructure choice. Trails GPT-4o-mini ~8% on content quality (intent accuracy + factuality). Optimal for style-sensitive conversational use cases at sustained traffic where dedicated GPU instances become economical vs per-token API pricing.

Cost comparison (measured)

Benchmarked on NVIDIA T4 (HuggingFace transformers, fp16, batch=16, ~294 tok/sec).

Infrastructure $/M tokens vs GPT-4o-mini
AWS T4 on-demand $0.50 parity
GCP T4 on-demand $0.33 1.5× cheaper
AWS T4 reserved (1yr) $0.30 1.7× cheaper
RunPod community $0.18 2.8× cheaper
AWS T4 spot $0.15 3.4× cheaper
GPT-4o-mini API $0.51 (blended 20%/80% in/out) baseline

Note: a vLLM or TGI deployment would likely improve self-hosted throughput by ~50-100%, shifting the comparison further in the fine-tune's favor. This was not benchmarked here due to environment constraints.

How to use

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-3B-Instruct",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B-Instruct")

# Load the LoRA adapter
model = PeftModel.from_pretrained(base_model, "DSMJ910/qwen2.5-3b-hinglish-lora")

messages = [{"role": "user", "content": "Bhai weekend pe Bangalore mein kya karein?"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to("cuda")
outputs = model.generate(inputs, max_new_tokens=300, do_sample=True, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training details

  • Base model: Qwen2.5-3B-Instruct (4-bit NF4 quantization)
  • Adapter: LoRA rank=16, alpha=32, dropout=0
  • Target modules: all linear layers (q/k/v/o projections + MLP gate/up/down)
  • Trainable parameters: 30M (1% of base model)
  • Training data: 10,594 synthetic Hinglish instruction examples (see dataset link)
  • Hyperparameters: lr=2e-4, batch_size=16 (effective), 2 epochs, AdamW 8-bit, linear schedule, bf16
  • Hardware: Single Blackwell GPU (95 GB VRAM)
  • Training time: 9.2 minutes
  • Adapter size: 125 MB

Evaluation

Quantitative

  • 50-prompt hand-curated Hinglish eval set (4 categories: casual, customer support, Q&A, sentiment)
  • Automated metrics: Hinglish marker density, English drift detection, Devanagari injection check
  • LLM-as-judge: Claude Sonnet 4.6 evaluating pairwise on 4 axes (Register, Intent, Quality, Culture)
  • Methodological note: Used Claude (different vendor than training data generator GPT-4o-mini) to avoid evaluation circularity.

Known limitations

  1. Roman script only. Training data is 100% Roman Hinglish; mixed-script inputs (Devanagari) may not be handled robustly. Future v2 will address.
  2. Conversational > instructional. Model defaults to "friendly chat" mode which sometimes reduces precision on classification tasks (e.g., confuses sentiment vs intent classification).
  3. Synthetic training data. All training examples were generated by GPT-4o-mini; this introduces stylistic patterns specific to GPT-4o-mini that the fine-tune inherits.
  4. Small eval set. N=50 prompts; larger evaluation would tighten confidence intervals.

Citation

If you use this model, please cite:

@misc{hinglish-qwen-3b-2026,
  title={Qwen2.5-3B Hinglish: QLoRA Fine-tuning for Indian Code-Mixed Conversation},
  author={Muskan Jaiswal},
  year={2026},
  publisher={HuggingFace},
  url={https://huggingface.co/DSMJ910/qwen2.5-3b-hinglish-lora}
}
Downloads last month
38
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for DSMJ910/qwen2.5-3b-hinglish-lora

Base model

Qwen/Qwen2.5-3B
Adapter
(1306)
this model