Edit model card

AraT5-MSAizer

This model is a fine-tuned version of UBC-NLP/AraT5v2-base-1024 for translating five regional Arabic dialects into Modern Standard Arabic (MSA).

Intended uses & limitations

This model was developed to participate in Task 2: Dialect to MSA Machine Translation under the 6th Workshop on Open-Source Arabic Corpora and Processing Tools. It was only evaluated on the development and test datasets provided by the task organizers.

Training and evaluation data

The model was fine-tuned on a blend of four distinct datasets; three of which comprised 'gold' parallel MSA-dialect sentence pairs. The fourth dataset, considered 'silver', was generated through back-translation from MSA to dialect.

Gold parallel corpora

  • The Multi-Arabic Dialects Application and Resources (MADAR)
  • The North Levantine Corpus
  • The Parallel Arabic DIalect Corpus (PADIC)

Synthetic Data A back-translated subset of the Arabic sentences in OPUS

Evaluation results

BLEU score on the development split of Task 2: Dialect to MSA Machine Translation under the 6th Workshop on Open-Source Arabic Corpora and Processing Tools.

Model BLEU
AraT5-MSAizer. 0.2302

Official evaluation results on the held-out test split

Model BLEU Comet DA
AraT5-MSAizer 0.2179 0.0016

Training procedure

The model was trained by fully fine-tuning UBC-NLP/AraT5v2-base-1024 for one epoch only. The maximum input length is set to 1024 (same as in the original pre-trained model) whereas the maximum generation length is set to 512.

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 2e-05
  • train_batch_size: 32
  • eval_batch_size: 32
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • warmup_ratio: 0.05
  • num_epochs: 1

Full training script and configuration can be found on https://github.com/Murhaf/AraT5-MSAizer

Training results

Framework versions

  • Transformers 4.38.1
  • Pytorch 2.0.1
  • Datasets 2.17.1
  • Tokenizers 0.15.2
Downloads last month
169
Safetensors
Model size
368M params
Tensor type
F32
·

Finetuned from