chamgei-kal2sw-nllb600m β€” r001 (epoch 32)

LoRA adapter that fine-tunes facebook/nllb-200-distilled-600M for Kalenjin β†’ Swahili translation. Companion to the forward-direction adapter used in chamgei.com's translate demo; unlocks the reverse direction for the bidirectional translate UX, plus enables Kalenjin β†’ English transitively via Helsinki SW↔EN.

Trained by Tony Kipkemboi at chamgei.labs, a research effort focused on machine translation for underserved Kenyan languages.

Highlights

  • chrF++ 68.55 / BLEU 57.58 on the inverted 250-row thinkKenya kln_swa/test holdout (seed=42)
  • +9.76 chrF++ over the equivalent SW β†’ KAL recipe (Phase 1d replicated mutaician's 58.79 on the same data)
  • Direction-asymmetry win: generating high-resource Swahili from low-resource Kalenjin is easier than the reverse, even with identical paired data β€” and this run quantifies the gap

Tag scheme

This repository hosts the chamgei-kal2sw-nllb600m family. Specific runs are pinned via revision tags:

Tag What chrF++
r001-ep32 ⭐ Epoch-32 checkpoint of run r001 (peak quality) 68.55

The main branch always points at the latest recommended adapter.

Use

from transformers import AutoModelForSeq2SeqLM, NllbTokenizerFast
from peft import PeftModel
import torch

base = "facebook/nllb-200-distilled-600M"
adapter = "Tonykip/chamgei-kal2sw-nllb600m"  # main = latest, or revision="r001-ep32"

tokenizer = NllbTokenizerFast.from_pretrained(base)
tokenizer.src_lang = "luo_Latn"  # Kalenjin via the trained hijack
model = AutoModelForSeq2SeqLM.from_pretrained(base, torch_dtype=torch.float32)
model = PeftModel.from_pretrained(model, adapter)
model.eval()

text = "kere inee oleloo"  # "anatazama mbali" (he is looking far)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
out = model.generate(
    **inputs,
    forced_bos_token_id=tokenizer.convert_tokens_to_ids("swh_Latn"),
    max_length=256,
    num_beams=5,
    length_penalty=1.1,
    early_stopping=True,
)
print(tokenizer.batch_decode(out, skip_special_tokens=True)[0])
# β†’ "Yeye ni kuangalia mbali."

Training recipe

Setting Value
Base model facebook/nllb-200-distilled-600M
Adapter LoRA r=64, alpha=128, dropout=0.05
LoRA target modules q_proj, k_proj, v_proj, out_proj, fc1, fc2 (attention + FFN)
Trainable params 34,603,008 (5.33% of base)
Training direction KAL β†’ SW (luo_Latn β†’ swh_Latn)
Warm-start Phase 1d adapter (SW β†’ KAL on same data, chrF++ 58.79)
Training data thinkKenya/kenyan-low-resource-language-data, kln_swa split, ~28k pairs
Eval data Same source, test split, 250-row sample with random_state=42
Learning rate / scheduler 1e-4 / cosine, warmup_ratio=0.05
Per-device batch size / grad accum 8 / 1
Max seq length 256
Epochs trained 42 (peak at epoch 32)
Optimizer AdamW (default), fp16
Seed 42
Compute A10G, 6h 9min wall-clock, ~$6.80
Inference num_beams=5, length_penalty=1.1

Recipe inherits from Phase 1d's mutaician-equivalent setup. The novel pieces here are: direction inversion (KAL β†’ SW instead of SW β†’ KAL), warm-starting from Phase 1d's adapter (reuses the trained Kalenjin BPE embeddings), and tuning the inference length penalty.

Training curve

Evaluations at 25 / 50 / 75 / 100 % of training (same 250-row holdout, KAL β†’ SW direction):

Epoch Step chrF++ BLEU Ξ” vs Phase 1d (58.79)
11 38,643 53.65 33.90 -5.14
21 73,773 64.69 52.11 +5.90
32 ⭐ 112,416 68.55 57.58 +9.76 (peak)
41 144,033 68.51 57.69 +9.72
42 147,546 68.49 57.66 +9.70

The published checkpoint (r001-ep32) is the epoch-32 peak. Epochs 32, 41, and 42 are all within 0.06 chrF++ of each other β€” the model plateaued. Future runs in this family should train to ~32 epochs for the same quality at 24% less compute.

Limitations

  • Stylistic shifts in 30% of outputs β€” the model embellishes occasionally (adding Yeye, sana, etc.) or shifts mood (declarative β†’ interrogative). A length-penalty sweep confirmed this is learned behaviour, not a decoding artifact; future runs will explore RL with chrF++ reward or filtered training data.
  • Rare-vocabulary misses β€” words like chigoni (kitchen) and number/multiplier compounds like taman ... taman occasionally produce off-meaning translations.
  • Dialect coverage β€” training data is mainstream Kalenjin (Nandi + Kipsigis tagged); other sub-tribes (Tugen, Marakwet, Sabaot, Keiyo, Pokot, Sengwer, Ogiek, Terik) have 0 rows in the corpus.
  • Domain coverage β€” thinkKenya leans toward everyday + religious + procedural text; legal, technical, or scientific Kalenjin is out-of-distribution.

License

CC-BY-NC-4.0, inheriting from the facebook/nllb-200-distilled-600M base model. Non-commercial use only. For commercial inquiries, contact iamtonykipkemboi@gmail.com.

Acknowledgments

  • thinkKenya for the kenyan-low-resource-language-data corpus
  • mutaician for publishing the NLLB+LoRA Western-Nilotic-hijack recipe that this run inverts and extends
  • Meta AI for the NLLB-200 base model
  • The Modal team for the training infrastructure

Citation

If you use this adapter in research, please cite:

@misc{chamgei_kal2sw_2026,
  author = {Kipkemboi, Tony},
  title = {chamgei-kal2sw-nllb600m: NLLB+LoRA for Kalenjin to Swahili Translation},
  year = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Tonykip/chamgei-kal2sw-nllb600m}}
}
Downloads last month
17
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Tonykip/chamgei-kal2sw-nllb600m

Adapter
(96)
this model

Dataset used to train Tonykip/chamgei-kal2sw-nllb600m