mBART-50 fine-tuned for English ⇄ Krio translation

A bidirectional machine-translation model for Krio (the English-lexified creole spoken in Sierra Leone), fine-tuned from facebook/mbart-large-50-many-to-many-mmt. A single model translates English → Krio and Krio → English.

Krio is not one of mBART-50's 50 supported languages, so a dedicated language token kri_SL was added to the tokenizer (warm-started from the English embedding en_XX) before fine-tuning.

Model Details

Model Description

Developed by: Moses Joshua Coker
Model type: Sequence-to-sequence Transformer (mBART-50) for translation
Language(s) (NLP): English (en), Krio (kri)
License: MIT (inherited from the base model)
Finetuned from model: facebook/mbart-large-50-many-to-many-mmt

Model Sources

Repository: https://huggingface.co/MosesJoshuaCoker/mbart-large-50-krio
Base model: https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt
Training data: https://huggingface.co/datasets/MosesJoshuaCoker/krio_dataset_novax

Uses

Direct Use

Translating short, everyday text between English and Krio — greetings, common phrases, basic conversational and informational sentences.

Downstream Use

A starting checkpoint for further fine-tuning on larger or domain-specific English–Krio parallel data, or for back-translation pipelines that generate synthetic data to expand Krio resources.

Out-of-Scope Use

Not suitable for high-stakes settings (medical, legal, safety-critical) without human review. Quality degrades on long, technical, or out-of-domain text, and on code-switched input. It does not translate languages other than English and Krio.

Bias, Risks, and Limitations

Small training set (~1,943 pairs) of mostly short phrases and everyday vocabulary, so coverage is narrow and the model may be fluent-but-wrong on unfamiliar inputs.
Krio has no fully standardized orthography; the model reflects the spelling conventions of this dataset (including characters such as ɛ, ɔ) and may not match other written conventions.
Like all NMT models, it can hallucinate, omit content, or carry over social biases present in the training data.

Recommendations

Use human review for anything consequential, prefer short/simple inputs, and report chrF alongside BLEU since chrF is more reliable for low-resource, morphologically varied text.

How to Get Started with the Model

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

REPO = "MosesJoshuaCoker/mbart-large-50-krio"   # update to your repo id
EN, KRI = "en_XX", "kri_SL"

tok = AutoTokenizer.from_pretrained(REPO)
model = AutoModelForSeq2SeqLM.from_pretrained(REPO)

# kri_SL is a custom language token, so it is not in lang_code_to_id after a
# fresh load — register it (and pass the id directly to forced_bos_token_id).
if KRI not in tok.lang_code_to_id:
    tok.lang_code_to_id[KRI] = tok.convert_tokens_to_ids(KRI)

def translate(text, src_lang, tgt_lang, num_beams=5, max_new_tokens=96):
    tok.src_lang = src_lang
    enc = tok(text, return_tensors="pt")
    out = model.generate(
        **enc,
        forced_bos_token_id=tok.convert_tokens_to_ids(tgt_lang),
        num_beams=num_beams,
        max_new_tokens=max_new_tokens,
    )
    return tok.batch_decode(out, skip_special_tokens=True)[0]

print(translate("Good morning, how are you?", EN, KRI))   # English -> Krio
print(translate("Aw yu de do?", KRI, EN))                  # Krio -> English

Training Details

Training Data

MosesJoshuaCoker/krio_dataset_novax — 1,943 parallel English↔Krio pairs (columns English, Krio), mostly short phrases and everyday vocabulary.

Training Procedure

Preprocessing

Empty/whitespace-only rows filtered out.
Split at the pair level into train / validation / test (≈90% / 5% / 5%) so no pair leaks across splits.
Each training pair was used in both directions (en→kri and kri→en), so a single model is bidirectional. Direction is controlled at inference via forced_bos_token_id.
Added the kri_SL language token, resized embeddings, and warm-started it from en_XX. Any orthographic characters not representable by the SentencePiece vocab were added as tokens.

Training Hyperparameters

Training regime: fp16 mixed precision
Base model: facebook/mbart-large-50-many-to-many-mmt (~610M params)
Optimizer: AdamW (fused), learning rate 3e-5, weight decay 0.01, warmup ratio 0.1
Epochs: 15, best checkpoint selected by validation loss
Batch: per-device 8 × gradient accumulation 4 = effective batch size 32
Max sequence length: 96 tokens
Other: gradient checkpointing enabled, label smoothing 0.0

Speeds, Sizes, Times

Hardware: single NVIDIA Tesla T4 (16 GB) on a Kaggle notebook.
Fine-tuning runs in roughly an hour for this dataset size.

Evaluation

Testing Data, Factors & Metrics

Testing Data

The held-out ~5% test split of MosesJoshuaCoker/krio_dataset_novax, evaluated in both translation directions.

Metrics

chrF (sacreBLEU) — primary metric; character-level F-score, well suited to low-resource and morphologically varied text.
BLEU (sacreBLEU) — reported for comparability.

Decoding: beam search (num_beams=5).

Results

TODO: paste the numbers printed by the notebook's evaluation cell.

Direction	chrF	BLEU
English → Krio	TODO	TODO
Krio → English	TODO	TODO

Summary

A compact bidirectional EN⇄Krio model. Given the small training set, treat chrF as the headline metric and expect best results on short, in-domain inputs.

Environmental Impact

Carbon emissions can be estimated with the Machine Learning Impact calculator (Lacoste et al., 2019).

Hardware Type: NVIDIA Tesla T4 (16 GB)
Hours used: ~1
Cloud Provider: Kaggle (Google Cloud Platform)
Compute Region: Unknown

Technical Specifications

Model Architecture and Objective

mBART-50, a multilingual sequence-to-sequence Transformer (12-layer encoder / 12-layer decoder), trained with a token-level cross-entropy translation objective. Source and target languages are specified with language-code tokens (en_XX, kri_SL); the target language is forced at decode time via forced_bos_token_id.

Compute Infrastructure

Hardware: 1× NVIDIA Tesla T4 (16 GB)
Software: PyTorch, 🤗 Transformers, Datasets, and Evaluate (sacreBLEU)

Citation

If you use this model, please cite the base model and dataset.

mBART-50 (base model):

@article{tang2020multilingual,
  title={Multilingual Translation with Extensible Multilingual Pretraining and Finetuning},
  author={Tang, Yuqing and Tran, Chau and Li, Xian and Chen, Peng-Jen and Goyal, Naman and Chaudhary, Vishrav and Gu, Jiatao and Fan, Angela},
  journal={arXiv preprint arXiv:2008.00401},
  year={2020}
}

Model Card Authors

Moses Joshua Coker

Model Card Contact

Via the Hugging Face repository: https://huggingface.co/MosesJoshuaCoker

Downloads last month: 44

Safetensors

Model size

0.6B params

Tensor type

F32

Model tree for MosesJoshuaCoker/text-to-text

Base model

facebook/mbart-large-50-many-to-many-mmt

Finetuned

(257)

this model

Dataset used to train MosesJoshuaCoker/text-to-text

Paper for MosesJoshuaCoker/text-to-text

Multilingual Translation with Extensible Multilingual Pretraining and Finetuning

Paper • 2008.00401 • Published Aug 2, 2020 • 1