Edit model card

Terjman-Ultra (1.3B)

Our model is built upon the powerful Transformer architecture, leveraging state-of-the-art natural language processing techniques. It is a fine-tuned version of facebook/nllb-200-1.3B on a the darija_english dataset enhanced with curated corpora ensuring high-quality and accurate translations.

It achieves the following results on the evaluation set:

  • Loss: 2.7070
  • Bleu: 4.6998
  • Gen Len: 35.6088

The finetuning was conducted using a A100-40GB and took 32 hours.

Try it out on our dedicated Terjman-Ultra Space 🤗

Usage

Using our model for translation is simple and straightforward. You can integrate it into your projects or workflows via the Hugging Face Transformers library. Here's a basic example of how to use the model in Python:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("atlasia/Terjman-Ultra")
model = AutoModelForSeq2SeqLM.from_pretrained("atlasia/Terjman-Ultra")

# Define your Moroccan Darija Arabizi text
input_text = "Your english text goes here."

# Tokenize the input text
input_tokens = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True)

# Perform translation
output_tokens = model.generate(**input_tokens)

# Decode the output tokens
output_text = tokenizer.decode(output_tokens[0], skip_special_tokens=True)

print("Translation:", output_text)

Example

Let's see an example of transliterating Moroccan Darija Arabizi to Arabic:

Input: "Hi my friend, can you tell me a joke in moroccan darija? I'd be happy to hear that from you!"

Output: "أهلا صاحبي، تقدر تقولي مزحة بالدارجة المغربية؟ غادي نكون فرحان باش نسمعها منك!"

Limiations

This version has some limitations mainly due to the Tokenizer. We're currently collecting more data with the aim of continous improvements.

Feedback

We're continuously striving to improve our model's performance and usability and we will be improving it incrementaly. If you have any feedback, suggestions, or encounter any issues, please don't hesitate to reach out to us.

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 3e-05
  • train_batch_size: 4
  • eval_batch_size: 4
  • seed: 42
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 16
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_ratio: 0.03
  • num_epochs: 25

Training results

Training Loss Epoch Step Validation Loss Bleu Gen Len
3.203 0.9999 2242 2.9015 4.3057 36.7548
2.9175 1.9998 4484 2.7602 4.4286 35.708
2.8558 2.9997 6726 2.7303 4.629 35.562
2.8696 4.0 8969 2.7195 4.6537 35.562
2.8604 4.9999 11211 2.7144 4.6905 35.5702
2.8509 5.9998 13453 2.7112 4.599 35.5427
2.853 6.9997 15695 2.7098 4.6625 35.5317
2.8475 8.0 17938 2.7081 4.6901 35.6419
2.8192 8.9999 20180 2.7082 4.5474 35.6391
2.8395 9.9998 22422 2.7077 4.722 35.6088
2.8395 10.9997 24664 2.7076 4.752 35.5868
2.8362 12.0 26907 2.7074 4.6664 35.562
2.8673 12.9999 29149 2.7072 4.7004 35.6639
2.8465 13.9998 31391 2.7076 4.6715 35.5923
2.8281 14.9997 33633 2.7075 4.7045 35.5647
2.8191 16.0 35876 2.7068 4.7487 35.6253
2.874 16.9999 38118 2.7076 4.71 35.6006
2.8666 17.9998 40360 2.7069 4.6047 35.6281
2.8645 18.9997 42602 2.7063 4.6664 35.6088
2.8458 20.0 44845 2.7070 4.6552 35.5813
2.8501 20.9999 47087 2.7074 4.6919 35.5647
2.8309 21.9998 49329 2.7074 4.623 35.6226
2.854 22.9997 51571 2.7072 4.6495 35.5978
2.8407 24.0 53814 2.7070 4.6879 35.5482
2.8129 24.9972 56050 2.7070 4.6998 35.6088

Framework versions

  • Transformers 4.40.2
  • Pytorch 2.2.1+cu121
  • Datasets 2.19.1
  • Tokenizers 0.19.1
Downloads last month
18
Safetensors
Model size
1.37B params
Tensor type
BF16
·

Finetuned from

Dataset used to train atlasia/Terjman-Ultra

Collection including atlasia/Terjman-Ultra