HelaLM-4B - A State-of-the-Art Open Instruction Model for Sinhala (සිංහල)

HelaLM (හෙළ - an ancient endonym for Sri Lanka and the Sinhala people) is an instruction-tuned adaptation of google/gemma-3-4b-it for Sinhala, a language spoken by ~22 million people yet represented in well under 0.1% of public web text. On a held-out 300-prompt native-Sinhala evaluation, HelaLM is preferred over its own base model in 92.3% of blind pairwise comparisons (277-23-0; +84.7% relative), while simultaneously improving every multiple-choice knowledge benchmark we measured.

Most low-resource adaptations reach for older bases such as Llama-3.2. HelaLM is built instead on Gemma-3-4b, Google's newer-generation open model, and the choice is evidence-backed, not cosmetic: in our controlled ablation the identical recipe on Llama-3.2-3B produced degenerate, code-switched Sinhala, while on Gemma-3 it was 94% preferred. Note also what the 92.3% means: picking a weak base inflates any "relative improvement" number because a broken model is trivial to beat. We deliberately took the strongest modern base, which already produces passable Sinhala, and still win 92.3% of blind comparisons against it. That is a far harder and more meaningful result.

This card documents the full methodology, the controlled ablations that produced it, the evaluation protocol, and the failure modes, with the explicit goal of being reproducible end to end.

Submitted to the Adaption AutoScientist Challenge (Part 1 - Language). The training corpus was produced with Adaptive Data (Adaption Labs); the model was trained on the resulting adapted dataset. Both dataset and weights are released openly (see Reproducibility).

1. TL;DR results (vs. the base model it was trained on)

Axis	Metric	`gemma-3-4b-it` (base)	HelaLM-4B	Δ
Open-ended generation	Win rate, blind pairwise (n=300)	7.7% (won 23/300)	92.3% (won 277/300)	+84.7% vs 50% ref
Knowledge	Global-MMLU `si` (acc, n=1000)	36.6%	37.8%	+3.3% rel.
Reading comprehension	Belebele `sin_Sinh` (acc, n=900)	58.3%	60.4%	+3.6% rel.
Curriculum knowledge	SinhalaMMLU (acc, n=1000)	40.6%	42.0%	+3.4% rel.
Robustness	Foreign-script code-switching	84 / 300 responses	0 / 300	eliminated
Robustness	Degenerate repetition loops	21 / 300	1 / 300	−95%

The headline is the generation win rate: Sinhala quality is fundamentally a generative, not a multiple-choice, property, and pairwise preference is the metric used by the multilingual-evaluation lineage this work builds on (Aya / Global-MMLU). HelaLM also raises MCQ accuracy - notable because instruction-tuning for fluent long-form generation frequently degrades logprob-style MCQ scoring.

Results at a glance

2. Why this works: a controlled study, not a single run

The final model is the product of a 4-run ablation that isolates the two variables that actually govern low-resource adaptation - base model and training intensity - before scaling data:

Run	Base	Recipe	Data	Outcome
1	Gemma-3-4b-it	LoRA r=8, 2 ep, attn-only	Aya-only (≈10k)	No gain. Recipe too light to imprint a language.
2	Llama-3.2-3B	LoRA r=32, 3 ep, all-linear	seed v1 (14k)	MCQ ↑ but generation degenerate - base too weak at Sinhala script.
3	Gemma-3-4b-it	LoRA r=32, 3 ep, all-linear	seed v1 (14k)	94.2% win vs base. First strong model.
4 (this)	Gemma-3-4b-it	LoRA r=32, 3 ep, all-linear	seed v2 (35k + math/comprehension)	92.3% win vs base; beats run 3 head-to-head 56.8%; fixes MCQ regressions.

Two findings of independent interest:

Base capacity dominates at 3-4B. Identical recipe + data on Llama-3.2-3B (run 2) produced fluent-looking but degenerate Sinhala; the same on Gemma-3-4b-it (run 3) produced a 94%-preferred model. Gemma's 262k-token vocabulary tokenizes Sinhala script far more efficiently, which we believe is decisive at this scale.
Targeted data beats more data. Run 3 (14k) regressed on Belebele (−11% rel.). Run 4 rebalanced the mixture (added programmatic math-reasoning + headline-comprehension shards) and recovered every MCQ benchmark to net-positive without sacrificing generation quality.

3. Training details

Base: google/gemma-3-4b-it (text decoder; Gemma3ForCausalLM).
Method: LoRA SFT, completion-only loss (train_on_inputs=false).
LoRA: rank 32, α 64, dropout 0.05, target = all linear projections (q,k,v,o,gate,up,down).
Optimizer: AdamW, lr 1e-4, cosine schedule, warmup 0.05, weight decay 0.01, grad-clip 1.0, bf16.
Schedule: 3 epochs over 34,667 examples; final training loss 0.792; ~7.6 h on a single A100-80GB.
Chat format: Gemma chat template; loss masked to assistant turns only.

4. Training data (released)

A 34,667-example Sinhala instruction set, co-optimized with Adaptive Data (Adaption Labs) from openly-licensed seed corpora, then curated. Composition (post-adaptation): human-written instructions (Aya), large-scale instruction coverage (Aya Collection, quality-filtered), abstractive summarization (XL-Sum), headline comprehension (NSINA), translation (FLORES+), exam-format MCQ with explanations (Global-MMLU dev + Wikipedia-grounded synthetic), and a programmatic math-reasoning shard with worked solutions. Every row carries per-row license/shard/source provenance. Full breakdown, licensing, and attribution: [helalm-data dataset card].

5. Evaluation protocol (reproducible)

Suite: Global-MMLU si (test), SinhalaMMLU, Belebele sin_Sinh (logprob over A-D, matched code path for base & adapted), FLORES+ chrF, and a 300-prompt open-ended set spanning 10 domains (culture, history, geography, food, practical, exam, general knowledge, writing, reasoning, grammar) authored to be answerable without post-training-cutoff knowledge.
Generation judging: blind pairwise. For each prompt the two responses are presented in randomized A/B order (labels hidden from judges); 10 independent judges score disjoint shards under a fixed rubric (language fidelity → non-degeneration → factual/cultural accuracy → completeness). Win rate = (wins + 0.5·ties)/judged. Contamination controlled: test splits never appear in training (SinhalaMMLU is eval-mirror-only).
All scripts and the prompt set are released for exact reproduction.

6. Limitations & responsible use

Mathematical reasoning is the one category where HelaLM does not lead its predecessor: its verbose worked-solutions can exceed short generation budgets and truncate before the final answer. Use a generation budget ≥768 tokens for math.
Knowledge is bounded by the base model and the adaptation corpus; not for high-stakes (medical, legal, financial) decisions without expert review.
Inherits base-model biases; Sinhala safety filtering is best-effort.
A LoRA adapter (and merged weights) - intended as a research artifact for Sinhala NLP.

7. Reproducibility & license

Eval harness, data-build pipeline, and per-checkpoint scoreboard are released alongside the model.
License: governed by the Gemma Terms of Use and the Gemma Prohibited Use Policy. Built from google/gemma-3-4b-it; "Gemma" is a trademark of Google LLC. Redistribution complies with the Gemma Terms; downstream users are bound by them.

8. Usage

from transformers import AutoTokenizer, Gemma3ForCausalLM
from peft import PeftModel
import torch

tok = AutoTokenizer.from_pretrained("google/gemma-3-4b-it")
base = Gemma3ForCausalLM.from_pretrained("google/gemma-3-4b-it", torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(base, "Jainamshahhh/helalm-4b").eval()   # or load the merged model directly

msg = [{"role": "user", "content": "වෙසක් පොහොය දිනයේ වැදගත්කම පැහැදිලි කරන්න."}]
ids = tok.apply_chat_template(msg, add_generation_prompt=True, return_tensors="pt").to(model.device)
print(tok.decode(model.generate(ids, max_new_tokens=512)[0][ids.shape[1]:], skip_special_tokens=True))

Author: Jainam Shah. Built with Adaptive Data (Adaption Labs) and the Gemma open model.

Downloads last month: -

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for Jainamshahhh/helalm-4b

Base model

google/gemma-3-4b-pt

Finetuned

google/gemma-3-4b-it

Adapter

(389)

this model

Adapters

2 models

Jainamshahhh
/

helalm-4b