Model Card for Model ID

MooreFR-SaChi Translation Model

⚠️ WARNING: This is only a template for researchers and developers interested in advancing work on languages not well-represented in large language models. For more comprehensive approaches, please consult the work of David Dalesite or explore github my repository for further insights and methodologies.

Model Details

This model is a fine-tuning of nllb-200-distilled-600M specialized for French-Moore (Mossi) language translation.\n It has been trained to handle translations between French (fr_Latn) and Moore (moor_Latn), with particularly strong performance in the French to Moore direction.

  • Base Model: facebook/nllb-200-distilled-600M
  • Languages: French (fr_Latn) ↔ Moore (moor_Latn)
  • Training Dataset: MooreFRCollections
  • Performance:
    • BLEU Score: 39.1 (direction: fra_Latn → moor_Latn) :( need improvement)
    • Training Loss: 1.01 (validation set of 1000 examples) :( need improvement)
  • Training Time: 2 hours on T4 GPU (3 epochs)

Usage

Here's how to use this model for translation:

import time
from transformers import NllbTokenizer, AutoModelForSeq2SeqLM
# Load model and tokenizer
MODEL_URL = "sawadogosalif/MooreFR-SaChi-translationv0"
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_URL)
tokenizer = NllbTokenizer.from_pretrained(MODEL_URL)
# Fix tokenizer for Moore language
def fix_tokenizer(tokenizer, new_lang):
    """
    Adds a new language token to the tokenizer and updates ID mappings.
    
    - Adds the special token if it doesn't already exist
    - Initializes or updates `lang_code_to_id` and `id_to_lang_code` using `getattr` to avoid repeated checks
    """
    if new_lang not in tokenizer.additional_special_tokens:
        tokenizer.add_special_tokens({'additional_special_tokens': [new_lang]})
    
    tokenizer.lang_code_to_id = getattr(tokenizer, 'lang_code_to_id', {})
    tokenizer.id_to_lang_code = getattr(tokenizer, 'id_to_lang_code', {})
    
    if new_lang not in tokenizer.lang_code_to_id:
        new_lang_id = tokenizer.convert_tokens_to_ids(new_lang)
        tokenizer.lang_code_to_id[new_lang] = new_lang_id
        tokenizer.id_to_lang_code[new_lang_id] = new_lang
    
    return tokenizer
# Initialize tokenizer with Moore language
fix_tokenizer(tokenizer, 'moor_Latn')
# Translation function
def translate(text, src_lang='fr_Latn', tgt_lang='moor_Latn', a=32, b=3, max_input_length=1024, num_beams=4, **kwargs):
    tokenizer.src_lang = src_lang
    tokenizer.tgt_lang = tgt_lang
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=max_input_length)
    result = model.generate(
        **inputs.to(model.device),
        forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),
        max_new_tokens=int(a + b * inputs.input_ids.shape[1]),
        num_beams=num_beams,
        **kwargs
    )
    return tokenizer.batch_decode(result, skip_special_tokens=True)
# Example usage
french_text = "Je suis né à Ouagadougou. J'ai demenagé à Banfora pour mes etudes"
moore_translation = translate(french_text, 'fr_Latn', 'moor_Latn')
print(moore_translation)
# Expected output: ['Mam doga Ouadagoou. Mam kẽnga Banfora m sẽn na yɩl n tɩ karem be.']

Alternative Translation Function

For more flexibility, you can use this enhanced translation function:

def translate_v2(text, model, tokenizer, src_lang='fr_Latn', tgt_lang='moor_Latn',
               max_length='auto', num_beams=4, no_repeat_ngram_size=4, n_out=None, **kwargs):
    tokenizer.src_lang = src_lang
    encoded = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    if max_length == 'auto':
        max_length = int(32 + 2.0 * encoded.input_ids.shape[1])
    model.eval()
    generated_tokens = model.generate(
        **encoded.to(model.device),
        forced_bos_token_id=tokenizer.lang_code_to_id[tgt_lang],
        max_length=max_length,
        num_beams=num_beams,
        no_repeat_ngram_size=no_repeat_ngram_size,
        num_return_sequences=n_out or 1,
        **kwargs
    )
    out = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
    if isinstance(text, str) and n_out is None:
        return out[0]
    return out

Training

This model was trained using the SaChi training framework with the following parameters:

model:
  name: "facebook/nllb-200-distilled-600M"
  save_path: "./models/nllb-moore-finetuned"
  new_lang_code: "moore_open"
training:
  batch_size: 16
  num_epochs: 3
  learning_rate: 1e-4
  warmup_steps: 1000
  max_length: 128
  accumulation_steps: 1
  eval_steps: 1000
  save_steps: 5000
  early_stopping_patience: 5
  fp16: true
  resume_from: null
  max_grad_norm: 1.0
data:
  dataset_name: "sawadogosalif/MooreFRCollections"
  train_size: 0.8
  test_size: 0.1
  val_size: 0.1
  random_seed: 2025
  src_col: "source"
  tgt_col: "target"
  src_lang_col: "french"
  tgt_lang_col: "moore"
evaluation:
  num_samples: 10
  num_beams: 5
  no_repeat_ngram_size: 3

The training was completed in approximately 2 hours on a T4 GPU for 3 epochs.

Dataset

This model was trained on the MooreFRCollections dataset, which contains parallel texts in French and Moore languages.

Limitations

  • The model performs best with standard French input text.
  • Performance may vary with highly technical, specialized, or colloquial language.
  • The model may not handle certain Moore dialectal variations perfectly.

Source Code

The training code is available in the SaChi repository.

Citation

@misc{author = {Sawadogo, Salif},
  title = {MooreFR-SaChi-translationv0},
  year = {202},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/sawadogosalif/MooreFR-SaChi-translationv0}}
}
Downloads last month
63
Safetensors
Model size
615M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Dataset used to train sawadogosalif/MooreFR-SaChi-translationv0

Space using sawadogosalif/MooreFR-SaChi-translationv0 1