cameroon-int8

int8 CTranslate2 serving bundle for ~60 Cameroonian languages (French-pivot MarianMT).

This repository is the quantized (int8) CTranslate2 serving bundle that powers translation for roughly 60 Cameroonian languages. Every model is a MarianMT translation model converted to the CTranslate2 format and quantized to int8.

All language pairs are French-pivot: translation goes either French -> local language (francais-<lang>) or local language -> French (<lang>-francais). To translate between two local languages, pivot through French.

Compared to the original fp32 PyTorch checkpoints, this int8 bundle is roughly 3.8x smaller on disk and runs about 6x faster at inference, which makes it practical to serve many languages from modest hardware.

Repository layout

Each subfolder is exactly one translation direction (one pair), and contains the full CTranslate2 model plus its tokenizer:

cameroon-int8/
β”œβ”€β”€ aghem-francais/
β”‚   β”œβ”€β”€ model.bin
β”‚   β”œβ”€β”€ config.json
β”‚   └── (tokenizer files)
β”œβ”€β”€ francais-aghem/
β”‚   β”œβ”€β”€ model.bin
β”‚   β”œβ”€β”€ config.json
β”‚   └── (tokenizer files)
β”œβ”€β”€ ...
└── yemba-francais/

There are 119 such pair subfolders.

Usage

Install dependencies:

pip install ctranslate2 transformers huggingface_hub sentencepiece

Download a single pair and translate with ctranslate2.Translator + transformers.MarianTokenizer:

from huggingface_hub import snapshot_download
import ctranslate2
from transformers import MarianTokenizer

pair = "francais-ewondo"  # French -> Ewondo

# Download just the one pair subfolder
local_dir = snapshot_download(
    repo_id="flagship-ai/cameroon-int8",
    allow_patterns=[f"{pair}/*"],
)
model_path = f"{local_dir}/{pair}"

tokenizer = MarianTokenizer.from_pretrained(model_path)
translator = ctranslate2.Translator(model_path, device="cpu")  # or device="cuda"

text = "Bonjour, comment allez-vous ?"
source = tokenizer.convert_ids_to_tokens(tokenizer.encode(text))
results = translator.translate_batch([source])
target = results[0].hypotheses[0]

output = tokenizer.decode(
    tokenizer.convert_tokens_to_ids(target),
    skip_special_tokens=True,
)
print(output)

Languages

The bundle covers directions to and from French for languages including: Aghem, Awing, Babanki, Bafia, Bakoko, Bakweri, Bidwee, Bulu, Bum, Cuvok, Denya, Dii, Doyayo, Ejagham, English, Esimbi, Ewondo, Fufulde, Gbaya, Ghomala, Guidar, Guiziga, Isu, Kapsiki, Kenyang, Koonzime, Lamnso, Limbum, Mankon, Massana, Mbembe, Medumba, Meta, Mmen, Mofa, Mofu, Moghamo, Mpumpong, Mundani, Ngi, Ngienboum, Ngomba, Ngombale, Ngwo, Nomaande, Nugunu, Oku, Pana, Peere, Pinyin, Punu, Samba, Tunen, Tupuri, Vute, Weh, Yambeta, Yemba, and more.

Links

License

Released under CC BY-NC 4.0. Intended for research and non-commercial use supporting Cameroonian language technology.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support