chamgei-kal2sw-nllb600m — r001 (epoch 32)

LoRA adapter that fine-tunes facebook/nllb-200-distilled-600M for Kalenjin → Swahili translation. Companion to the forward-direction adapter used in chamgei.com's translate demo; unlocks the reverse direction for the bidirectional translate UX, plus enables Kalenjin → English transitively via Helsinki SW↔EN.

Trained by Tony Kipkemboi at chamgei.labs, a research effort focused on machine translation for underserved Kenyan languages.

Highlights

chrF++ 68.55 / BLEU 57.58 on the inverted 250-row thinkKenya kln_swa/test holdout (seed=42)
+9.76 chrF++ over the equivalent SW → KAL recipe (Phase 1d replicated mutaician's 58.79 on the same data)
Direction-asymmetry win: generating high-resource Swahili from low-resource Kalenjin is easier than the reverse, even with identical paired data — and this run quantifies the gap

Tag scheme

This repository hosts the chamgei-kal2sw-nllb600m family. Specific runs are pinned via revision tags:

Tag	What	chrF++
`r001-ep32` ⭐	Epoch-32 checkpoint of run r001 (peak quality)	68.55

The main branch always points at the latest recommended adapter.

Use

from transformers import AutoModelForSeq2SeqLM, NllbTokenizerFast
from peft import PeftModel
import torch

base = "facebook/nllb-200-distilled-600M"
adapter = "Tonykip/chamgei-kal2sw-nllb600m"  # main = latest, or revision="r001-ep32"

tokenizer = NllbTokenizerFast.from_pretrained(base)
tokenizer.src_lang = "luo_Latn"  # Kalenjin via the trained hijack
model = AutoModelForSeq2SeqLM.from_pretrained(base, torch_dtype=torch.float32)
model = PeftModel.from_pretrained(model, adapter)
model.eval()

text = "kere inee oleloo"  # "anatazama mbali" (he is looking far)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
out = model.generate(
    **inputs,
    forced_bos_token_id=tokenizer.convert_tokens_to_ids("swh_Latn"),
    max_length=256,
    num_beams=5,
    length_penalty=1.1,
    early_stopping=True,
)
print(tokenizer.batch_decode(out, skip_special_tokens=True)[0])
# → "Yeye ni kuangalia mbali."

Training recipe

Setting	Value
Base model	`facebook/nllb-200-distilled-600M`
Adapter	LoRA r=64, alpha=128, dropout=0.05
LoRA target modules	`q_proj, k_proj, v_proj, out_proj, fc1, fc2` (attention + FFN)
Trainable params	34,603,008 (5.33% of base)
Training direction	KAL → SW (`luo_Latn` → `swh_Latn`)
Warm-start	Phase 1d adapter (SW → KAL on same data, chrF++ 58.79)
Training data	`thinkKenya/kenyan-low-resource-language-data`, `kln_swa` split, ~28k pairs
Eval data	Same source, `test` split, 250-row sample with `random_state=42`
Learning rate / scheduler	1e-4 / cosine, warmup_ratio=0.05
Per-device batch size / grad accum	8 / 1
Max seq length	256
Epochs trained	42 (peak at epoch 32)
Optimizer	AdamW (default), fp16
Seed	42
Compute	A10G, 6h 9min wall-clock, ~$6.80
Inference	`num_beams=5`, `length_penalty=1.1`

Recipe inherits from Phase 1d's mutaician-equivalent setup. The novel pieces here are: direction inversion (KAL → SW instead of SW → KAL), warm-starting from Phase 1d's adapter (reuses the trained Kalenjin BPE embeddings), and tuning the inference length penalty.

Training curve

Evaluations at 25 / 50 / 75 / 100 % of training (same 250-row holdout, KAL → SW direction):

Epoch	Step	chrF++	BLEU	Δ vs Phase 1d (58.79)
11	38,643	53.65	33.90	-5.14
21	73,773	64.69	52.11	+5.90
32 ⭐	112,416	68.55	57.58	+9.76 (peak)
41	144,033	68.51	57.69	+9.72
42	147,546	68.49	57.66	+9.70

The published checkpoint (r001-ep32) is the epoch-32 peak. Epochs 32, 41, and 42 are all within 0.06 chrF++ of each other — the model plateaued. Future runs in this family should train to ~32 epochs for the same quality at 24% less compute.

Limitations

Stylistic shifts in 30% of outputs — the model embellishes occasionally (adding Yeye, sana, etc.) or shifts mood (declarative → interrogative). A length-penalty sweep confirmed this is learned behaviour, not a decoding artifact; future runs will explore RL with chrF++ reward or filtered training data.
Rare-vocabulary misses — words like chigoni (kitchen) and number/multiplier compounds like taman ... taman occasionally produce off-meaning translations.
Dialect coverage — training data is mainstream Kalenjin (Nandi + Kipsigis tagged); other sub-tribes (Tugen, Marakwet, Sabaot, Keiyo, Pokot, Sengwer, Ogiek, Terik) have 0 rows in the corpus.
Domain coverage — thinkKenya leans toward everyday + religious + procedural text; legal, technical, or scientific Kalenjin is out-of-distribution.

License

CC-BY-NC-4.0, inheriting from the facebook/nllb-200-distilled-600M base model. Non-commercial use only. For commercial inquiries, contact iamtonykipkemboi@gmail.com.

Acknowledgments

thinkKenya for the kenyan-low-resource-language-data corpus
mutaician for publishing the NLLB+LoRA Western-Nilotic-hijack recipe that this run inverts and extends
Meta AI for the NLLB-200 base model
The Modal team for the training infrastructure

Citation

If you use this adapter in research, please cite:

@misc{chamgei_kal2sw_2026,
  author = {Kipkemboi, Tony},
  title = {chamgei-kal2sw-nllb600m: NLLB+LoRA for Kalenjin to Swahili Translation},
  year = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Tonykip/chamgei-kal2sw-nllb600m}}
}

Downloads last month: 17

Model tree for Tonykip/chamgei-kal2sw-nllb600m

Base model

facebook/nllb-200-distilled-600M

Adapter

(96)

this model

Tonykip
/

chamgei-kal2sw-nllb600m