Instructions to use Jainamshahhh/helalm-4b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use Jainamshahhh/helalm-4b with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
HelaLM-4B - A State-of-the-Art Open Instruction Model for Sinhala (සිංහල)
HelaLM (හෙළ - an ancient endonym for Sri Lanka and the Sinhala people) is an instruction-tuned
adaptation of google/gemma-3-4b-it for Sinhala, a language spoken by ~22 million people yet
represented in well under 0.1% of public web text. On a held-out 300-prompt native-Sinhala
evaluation, HelaLM is preferred over its own base model in 92.3% of blind pairwise comparisons
(277-23-0; +84.7% relative), while simultaneously improving every multiple-choice knowledge
benchmark we measured.
Most low-resource adaptations reach for older bases such as Llama-3.2. HelaLM is built instead on Gemma-3-4b, Google's newer-generation open model, and the choice is evidence-backed, not cosmetic: in our controlled ablation the identical recipe on Llama-3.2-3B produced degenerate, code-switched Sinhala, while on Gemma-3 it was 94% preferred. Note also what the 92.3% means: picking a weak base inflates any "relative improvement" number because a broken model is trivial to beat. We deliberately took the strongest modern base, which already produces passable Sinhala, and still win 92.3% of blind comparisons against it. That is a far harder and more meaningful result.
This card documents the full methodology, the controlled ablations that produced it, the evaluation protocol, and the failure modes, with the explicit goal of being reproducible end to end.
Submitted to the Adaption AutoScientist Challenge (Part 1 - Language). The training corpus was produced with Adaptive Data (Adaption Labs); the model was trained on the resulting adapted dataset. Both dataset and weights are released openly (see Reproducibility).
1. TL;DR results (vs. the base model it was trained on)
| Axis | Metric | gemma-3-4b-it (base) |
HelaLM-4B | Δ |
|---|---|---|---|---|
| Open-ended generation | Win rate, blind pairwise (n=300) | 7.7% (won 23/300) | 92.3% (won 277/300) | +84.7% vs 50% ref |
| Knowledge | Global-MMLU si (acc, n=1000) |
36.6% | 37.8% | +3.3% rel. |
| Reading comprehension | Belebele sin_Sinh (acc, n=900) |
58.3% | 60.4% | +3.6% rel. |
| Curriculum knowledge | SinhalaMMLU (acc, n=1000) | 40.6% | 42.0% | +3.4% rel. |
| Robustness | Foreign-script code-switching | 84 / 300 responses | 0 / 300 | eliminated |
| Robustness | Degenerate repetition loops | 21 / 300 | 1 / 300 | −95% |
The headline is the generation win rate: Sinhala quality is fundamentally a generative, not a multiple-choice, property, and pairwise preference is the metric used by the multilingual-evaluation lineage this work builds on (Aya / Global-MMLU). HelaLM also raises MCQ accuracy - notable because instruction-tuning for fluent long-form generation frequently degrades logprob-style MCQ scoring.
Results at a glance
2. Why this works: a controlled study, not a single run
The final model is the product of a 4-run ablation that isolates the two variables that actually govern low-resource adaptation - base model and training intensity - before scaling data:
| Run | Base | Recipe | Data | Outcome |
|---|---|---|---|---|
| 1 | Gemma-3-4b-it | LoRA r=8, 2 ep, attn-only | Aya-only (≈10k) | No gain. Recipe too light to imprint a language. |
| 2 | Llama-3.2-3B | LoRA r=32, 3 ep, all-linear | seed v1 (14k) | MCQ ↑ but generation degenerate - base too weak at Sinhala script. |
| 3 | Gemma-3-4b-it | LoRA r=32, 3 ep, all-linear | seed v1 (14k) | 94.2% win vs base. First strong model. |
| 4 (this) | Gemma-3-4b-it | LoRA r=32, 3 ep, all-linear | seed v2 (35k + math/comprehension) | 92.3% win vs base; beats run 3 head-to-head 56.8%; fixes MCQ regressions. |
Two findings of independent interest:
- Base capacity dominates at 3-4B. Identical recipe + data on Llama-3.2-3B (run 2) produced fluent-looking but degenerate Sinhala; the same on Gemma-3-4b-it (run 3) produced a 94%-preferred model. Gemma's 262k-token vocabulary tokenizes Sinhala script far more efficiently, which we believe is decisive at this scale.
- Targeted data beats more data. Run 3 (14k) regressed on Belebele (−11% rel.). Run 4 rebalanced the mixture (added programmatic math-reasoning + headline-comprehension shards) and recovered every MCQ benchmark to net-positive without sacrificing generation quality.
3. Training details
- Base:
google/gemma-3-4b-it(text decoder;Gemma3ForCausalLM). - Method: LoRA SFT, completion-only loss (
train_on_inputs=false). - LoRA: rank 32, α 64, dropout 0.05, target = all linear projections (
q,k,v,o,gate,up,down). - Optimizer: AdamW, lr 1e-4, cosine schedule, warmup 0.05, weight decay 0.01, grad-clip 1.0, bf16.
- Schedule: 3 epochs over 34,667 examples; final training loss 0.792; ~7.6 h on a single A100-80GB.
- Chat format: Gemma chat template; loss masked to assistant turns only.
4. Training data (released)
A 34,667-example Sinhala instruction set, co-optimized with Adaptive Data (Adaption Labs) from
openly-licensed seed corpora, then curated. Composition (post-adaptation): human-written instructions
(Aya), large-scale instruction coverage (Aya Collection, quality-filtered), abstractive summarization
(XL-Sum), headline comprehension (NSINA), translation (FLORES+), exam-format MCQ with explanations
(Global-MMLU dev + Wikipedia-grounded synthetic), and a programmatic math-reasoning shard with worked
solutions. Every row carries per-row license/shard/source provenance. Full breakdown, licensing,
and attribution: [helalm-data dataset card].
5. Evaluation protocol (reproducible)
- Suite: Global-MMLU
si(test), SinhalaMMLU, Belebelesin_Sinh(logprob over A-D, matched code path for base & adapted), FLORES+ chrF, and a 300-prompt open-ended set spanning 10 domains (culture, history, geography, food, practical, exam, general knowledge, writing, reasoning, grammar) authored to be answerable without post-training-cutoff knowledge. - Generation judging: blind pairwise. For each prompt the two responses are presented in randomized A/B order (labels hidden from judges); 10 independent judges score disjoint shards under a fixed rubric (language fidelity → non-degeneration → factual/cultural accuracy → completeness). Win rate = (wins + 0.5·ties)/judged. Contamination controlled: test splits never appear in training (SinhalaMMLU is eval-mirror-only).
- All scripts and the prompt set are released for exact reproduction.
6. Limitations & responsible use
- Mathematical reasoning is the one category where HelaLM does not lead its predecessor: its verbose worked-solutions can exceed short generation budgets and truncate before the final answer. Use a generation budget ≥768 tokens for math.
- Knowledge is bounded by the base model and the adaptation corpus; not for high-stakes (medical, legal, financial) decisions without expert review.
- Inherits base-model biases; Sinhala safety filtering is best-effort.
- A LoRA adapter (and merged weights) - intended as a research artifact for Sinhala NLP.
7. Reproducibility & license
- Eval harness, data-build pipeline, and per-checkpoint scoreboard are released alongside the model.
- License: governed by the Gemma Terms of Use and the
Gemma Prohibited Use Policy. Built from
google/gemma-3-4b-it; "Gemma" is a trademark of Google LLC. Redistribution complies with the Gemma Terms; downstream users are bound by them.
8. Usage
from transformers import AutoTokenizer, Gemma3ForCausalLM
from peft import PeftModel
import torch
tok = AutoTokenizer.from_pretrained("google/gemma-3-4b-it")
base = Gemma3ForCausalLM.from_pretrained("google/gemma-3-4b-it", torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(base, "Jainamshahhh/helalm-4b").eval() # or load the merged model directly
msg = [{"role": "user", "content": "වෙසක් පොහොය දිනයේ වැදගත්කම පැහැදිලි කරන්න."}]
ids = tok.apply_chat_template(msg, add_generation_prompt=True, return_tensors="pt").to(model.device)
print(tok.decode(model.generate(ids, max_new_tokens=512)[0][ids.shape[1]:], skip_special_tokens=True))
Author: Jainam Shah. Built with Adaptive Data (Adaption Labs) and the Gemma open model.
- Downloads last month
- -




