Model Card for Model ID
MooreFR-SaChi Translation Model
⚠️ WARNING: This is only a template for researchers and developers interested in advancing work on languages not well-represented in large language models. For more comprehensive approaches, please consult the work of David Dalesite or explore github my repository for further insights and methodologies.
Model Details
This model is a fine-tuning of nllb-200-distilled-600M
specialized for French-Moore (Mossi) language translation.\n It has been trained to handle translations between French (fr_Latn
) and Moore (moor_Latn
), with particularly strong performance in the French to Moore direction.
- Base Model: facebook/nllb-200-distilled-600M
- Languages: French (
fr_Latn
) ↔ Moore (moor_Latn
) - Training Dataset: MooreFRCollections
- Performance:
- BLEU Score: 39.1 (direction:
fra_Latn → moor_Latn
) :( need improvement) - Training Loss: 1.01 (validation set of 1000 examples) :( need improvement)
- BLEU Score: 39.1 (direction:
- Training Time: 2 hours on T4 GPU (3 epochs)
Usage
Here's how to use this model for translation:
import time
from transformers import NllbTokenizer, AutoModelForSeq2SeqLM
# Load model and tokenizer
MODEL_URL = "sawadogosalif/MooreFR-SaChi-translationv0"
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_URL)
tokenizer = NllbTokenizer.from_pretrained(MODEL_URL)
# Fix tokenizer for Moore language
def fix_tokenizer(tokenizer, new_lang):
"""
Adds a new language token to the tokenizer and updates ID mappings.
- Adds the special token if it doesn't already exist
- Initializes or updates `lang_code_to_id` and `id_to_lang_code` using `getattr` to avoid repeated checks
"""
if new_lang not in tokenizer.additional_special_tokens:
tokenizer.add_special_tokens({'additional_special_tokens': [new_lang]})
tokenizer.lang_code_to_id = getattr(tokenizer, 'lang_code_to_id', {})
tokenizer.id_to_lang_code = getattr(tokenizer, 'id_to_lang_code', {})
if new_lang not in tokenizer.lang_code_to_id:
new_lang_id = tokenizer.convert_tokens_to_ids(new_lang)
tokenizer.lang_code_to_id[new_lang] = new_lang_id
tokenizer.id_to_lang_code[new_lang_id] = new_lang
return tokenizer
# Initialize tokenizer with Moore language
fix_tokenizer(tokenizer, 'moor_Latn')
# Translation function
def translate(text, src_lang='fr_Latn', tgt_lang='moor_Latn', a=32, b=3, max_input_length=1024, num_beams=4, **kwargs):
tokenizer.src_lang = src_lang
tokenizer.tgt_lang = tgt_lang
inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=max_input_length)
result = model.generate(
**inputs.to(model.device),
forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),
max_new_tokens=int(a + b * inputs.input_ids.shape[1]),
num_beams=num_beams,
**kwargs
)
return tokenizer.batch_decode(result, skip_special_tokens=True)
# Example usage
french_text = "Je suis né à Ouagadougou. J'ai demenagé à Banfora pour mes etudes"
moore_translation = translate(french_text, 'fr_Latn', 'moor_Latn')
print(moore_translation)
# Expected output: ['Mam doga Ouadagoou. Mam kẽnga Banfora m sẽn na yɩl n tɩ karem be.']
Alternative Translation Function
For more flexibility, you can use this enhanced translation function:
def translate_v2(text, model, tokenizer, src_lang='fr_Latn', tgt_lang='moor_Latn',
max_length='auto', num_beams=4, no_repeat_ngram_size=4, n_out=None, **kwargs):
tokenizer.src_lang = src_lang
encoded = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
if max_length == 'auto':
max_length = int(32 + 2.0 * encoded.input_ids.shape[1])
model.eval()
generated_tokens = model.generate(
**encoded.to(model.device),
forced_bos_token_id=tokenizer.lang_code_to_id[tgt_lang],
max_length=max_length,
num_beams=num_beams,
no_repeat_ngram_size=no_repeat_ngram_size,
num_return_sequences=n_out or 1,
**kwargs
)
out = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
if isinstance(text, str) and n_out is None:
return out[0]
return out
Training
This model was trained using the SaChi training framework with the following parameters:
model:
name: "facebook/nllb-200-distilled-600M"
save_path: "./models/nllb-moore-finetuned"
new_lang_code: "moore_open"
training:
batch_size: 16
num_epochs: 3
learning_rate: 1e-4
warmup_steps: 1000
max_length: 128
accumulation_steps: 1
eval_steps: 1000
save_steps: 5000
early_stopping_patience: 5
fp16: true
resume_from: null
max_grad_norm: 1.0
data:
dataset_name: "sawadogosalif/MooreFRCollections"
train_size: 0.8
test_size: 0.1
val_size: 0.1
random_seed: 2025
src_col: "source"
tgt_col: "target"
src_lang_col: "french"
tgt_lang_col: "moore"
evaluation:
num_samples: 10
num_beams: 5
no_repeat_ngram_size: 3
The training was completed in approximately 2 hours on a T4 GPU for 3 epochs.
Dataset
This model was trained on the MooreFRCollections dataset, which contains parallel texts in French and Moore languages.
Limitations
- The model performs best with standard French input text.
- Performance may vary with highly technical, specialized, or colloquial language.
- The model may not handle certain Moore dialectal variations perfectly.
Source Code
The training code is available in the SaChi repository.
Citation
@misc{author = {Sawadogo, Salif},
title = {MooreFR-SaChi-translationv0},
year = {202},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/sawadogosalif/MooreFR-SaChi-translationv0}}
}
- Downloads last month
- 63