mBART-50 fine-tuned for English ⇄ Krio translation

A bidirectional machine-translation model for Krio (the English-lexified creole spoken in Sierra Leone), fine-tuned from facebook/mbart-large-50-many-to-many-mmt. A single model translates English → Krio and Krio → English.

Krio is not one of mBART-50's 50 supported languages, so a dedicated language token kri_SL was added to the tokenizer (warm-started from the English embedding en_XX) before fine-tuning.

Model Details

Model Description

  • Developed by: Moses Joshua Coker
  • Model type: Sequence-to-sequence Transformer (mBART-50) for translation
  • Language(s) (NLP): English (en), Krio (kri)
  • License: MIT (inherited from the base model)
  • Finetuned from model: facebook/mbart-large-50-many-to-many-mmt

Model Sources

Uses

Direct Use

Translating short, everyday text between English and Krio — greetings, common phrases, basic conversational and informational sentences.

Downstream Use

A starting checkpoint for further fine-tuning on larger or domain-specific English–Krio parallel data, or for back-translation pipelines that generate synthetic data to expand Krio resources.

Out-of-Scope Use

Not suitable for high-stakes settings (medical, legal, safety-critical) without human review. Quality degrades on long, technical, or out-of-domain text, and on code-switched input. It does not translate languages other than English and Krio.

Bias, Risks, and Limitations

  • Small training set (~1,943 pairs) of mostly short phrases and everyday vocabulary, so coverage is narrow and the model may be fluent-but-wrong on unfamiliar inputs.
  • Krio has no fully standardized orthography; the model reflects the spelling conventions of this dataset (including characters such as É›, É”) and may not match other written conventions.
  • Like all NMT models, it can hallucinate, omit content, or carry over social biases present in the training data.

Recommendations

Use human review for anything consequential, prefer short/simple inputs, and report chrF alongside BLEU since chrF is more reliable for low-resource, morphologically varied text.

How to Get Started with the Model

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

REPO = "MosesJoshuaCoker/mbart-large-50-krio"   # update to your repo id
EN, KRI = "en_XX", "kri_SL"

tok = AutoTokenizer.from_pretrained(REPO)
model = AutoModelForSeq2SeqLM.from_pretrained(REPO)

# kri_SL is a custom language token, so it is not in lang_code_to_id after a
# fresh load — register it (and pass the id directly to forced_bos_token_id).
if KRI not in tok.lang_code_to_id:
    tok.lang_code_to_id[KRI] = tok.convert_tokens_to_ids(KRI)

def translate(text, src_lang, tgt_lang, num_beams=5, max_new_tokens=96):
    tok.src_lang = src_lang
    enc = tok(text, return_tensors="pt")
    out = model.generate(
        **enc,
        forced_bos_token_id=tok.convert_tokens_to_ids(tgt_lang),
        num_beams=num_beams,
        max_new_tokens=max_new_tokens,
    )
    return tok.batch_decode(out, skip_special_tokens=True)[0]

print(translate("Good morning, how are you?", EN, KRI))   # English -> Krio
print(translate("Aw yu de do?", KRI, EN))                  # Krio -> English

Training Details

Training Data

MosesJoshuaCoker/krio_dataset_novax — 1,943 parallel English↔Krio pairs (columns English, Krio), mostly short phrases and everyday vocabulary.

Training Procedure

Preprocessing

  • Empty/whitespace-only rows filtered out.
  • Split at the pair level into train / validation / test (≈90% / 5% / 5%) so no pair leaks across splits.
  • Each training pair was used in both directions (en→kri and kri→en), so a single model is bidirectional. Direction is controlled at inference via forced_bos_token_id.
  • Added the kri_SL language token, resized embeddings, and warm-started it from en_XX. Any orthographic characters not representable by the SentencePiece vocab were added as tokens.

Training Hyperparameters

  • Training regime: fp16 mixed precision
  • Base model: facebook/mbart-large-50-many-to-many-mmt (~610M params)
  • Optimizer: AdamW (fused), learning rate 3e-5, weight decay 0.01, warmup ratio 0.1
  • Epochs: 15, best checkpoint selected by validation loss
  • Batch: per-device 8 × gradient accumulation 4 = effective batch size 32
  • Max sequence length: 96 tokens
  • Other: gradient checkpointing enabled, label smoothing 0.0

Speeds, Sizes, Times

  • Hardware: single NVIDIA Tesla T4 (16 GB) on a Kaggle notebook.
  • Fine-tuning runs in roughly an hour for this dataset size.

Evaluation

Testing Data, Factors & Metrics

Testing Data

The held-out ~5% test split of MosesJoshuaCoker/krio_dataset_novax, evaluated in both translation directions.

Metrics

  • chrF (sacreBLEU) — primary metric; character-level F-score, well suited to low-resource and morphologically varied text.
  • BLEU (sacreBLEU) — reported for comparability.

Decoding: beam search (num_beams=5).

Results

TODO: paste the numbers printed by the notebook's evaluation cell.

Direction chrF BLEU
English → Krio TODO TODO
Krio → English TODO TODO

Summary

A compact bidirectional EN⇄Krio model. Given the small training set, treat chrF as the headline metric and expect best results on short, in-domain inputs.

Environmental Impact

Carbon emissions can be estimated with the Machine Learning Impact calculator (Lacoste et al., 2019).

  • Hardware Type: NVIDIA Tesla T4 (16 GB)
  • Hours used: ~1
  • Cloud Provider: Kaggle (Google Cloud Platform)
  • Compute Region: Unknown

Technical Specifications

Model Architecture and Objective

mBART-50, a multilingual sequence-to-sequence Transformer (12-layer encoder / 12-layer decoder), trained with a token-level cross-entropy translation objective. Source and target languages are specified with language-code tokens (en_XX, kri_SL); the target language is forced at decode time via forced_bos_token_id.

Compute Infrastructure

  • Hardware: 1× NVIDIA Tesla T4 (16 GB)
  • Software: PyTorch, 🤗 Transformers, Datasets, and Evaluate (sacreBLEU)

Citation

If you use this model, please cite the base model and dataset.

mBART-50 (base model):

@article{tang2020multilingual,
  title={Multilingual Translation with Extensible Multilingual Pretraining and Finetuning},
  author={Tang, Yuqing and Tran, Chau and Li, Xian and Chen, Peng-Jen and Goyal, Naman and Chaudhary, Vishrav and Gu, Jiatao and Fan, Angela},
  journal={arXiv preprint arXiv:2008.00401},
  year={2020}
}

Model Card Authors

Moses Joshua Coker

Model Card Contact

Via the Hugging Face repository: https://huggingface.co/MosesJoshuaCoker

Downloads last month
44
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MosesJoshuaCoker/text-to-text

Finetuned
(257)
this model

Dataset used to train MosesJoshuaCoker/text-to-text

Paper for MosesJoshuaCoker/text-to-text