Edit model card

Model Card for Model ID

nllb-200-600M-En-Ar

This model is a fine-tuned version of the NLLB-200-600M model, specifically adapted for translating from English to Egyptian Arabic. Fine-tuned on a custom dataset of 12,000 samples, it aims to provide high-quality translations that capture the nuances and colloquial expressions of Egyptian Arabic.

The dataset used for fine-tuning was collected from high-quality transcriptions of videos, ensuring the language data is rich and contextually accurate.

Model Details

Usage

To use this model for translation, you can load it with the transformers library:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "Mhassanen/nllb-200-600M-En-Ar"
tokenizer = AutoTokenizer.from_pretrained(model_name, src_lang="eng_Latn", tgt_lang="arz_Arab")
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

def translate(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True)
    translated_tokens = model.generate(**inputs)
    translated_text = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)
    return translated_text

text = "Hello, how are you?"
print(translate(text))

Performance

The model has been evaluated on a validation set to ensure translation quality. While it excels at capturing colloquial Egyptian Arabic, ongoing improvements and additional data can further enhance its performance.

Limitations

  • Dataset Size: The custom dataset consists of 12,000 samples, which may limit coverage of diverse expressions and rare terms.
  • Colloquial Variations: Egyptian Arabic has many dialectal variations, which might not all be covered equally.

Acknowledgements

This model builds upon the NLLB-200-600M developed by Facebook AI, fine-tuned to cater specifically to the Egyptian Arabic dialect.

Feel free to contribute or provide feedback to help improve this model!

Downloads last month
24
Safetensors
Model size
615M params
Tensor type
F32
ยท

Space using Mhassanen/nllb-200-600M-En-Ar 1