AraT5-MSAizer

This model is a fine-tuned version of UBC-NLP/AraT5v2-base-1024 for translating five regional Arabic dialects into Modern Standard Arabic (MSA).

Intended uses & limitations

This model was developed to participate in Task 2: Dialect to MSA Machine Translation under the 6th Workshop on Open-Source Arabic Corpora and Processing Tools. It was only evaluated on the development and test datasets provided by the task organizers.

Training and evaluation data

The model was fine-tuned on a blend of four distinct datasets; three of which comprised 'gold' parallel MSA-dialect sentence pairs. The fourth dataset, considered 'silver', was generated through back-translation from MSA to dialect.

Gold parallel corpora

The Multi-Arabic Dialects Application and Resources (MADAR)
The North Levantine Corpus
The Parallel Arabic DIalect Corpus (PADIC)

Synthetic Data A back-translated subset of the Arabic sentences in OPUS

Evaluation results

BLEU score on the development split of Task 2: Dialect to MSA Machine Translation under the 6th Workshop on Open-Source Arabic Corpora and Processing Tools.

Model	BLEU
AraT5-MSAizer.	0.2302

Official evaluation results on the held-out test split

Model	BLEU	Comet DA
AraT5-MSAizer	0.2179	0.0016

Training procedure

The model was trained by fully fine-tuning UBC-NLP/AraT5v2-base-1024 for one epoch only. The maximum input length is set to 1024 (same as in the original pre-trained model) whereas the maximum generation length is set to 512.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 2e-05
train_batch_size: 32
eval_batch_size: 32
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
warmup_ratio: 0.05
num_epochs: 1

Full training script and configuration can be found on https://github.com/Murhaf/AraT5-MSAizer

Training results

Framework versions

Transformers 4.38.1
Pytorch 2.0.1
Datasets 2.17.1
Tokenizers 0.15.2

Murhaf
/

AraT5-MSAizer

AraT5-MSAizer

Intended uses & limitations

Training and evaluation data

Evaluation results

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for Murhaf/AraT5-MSAizer

Evaluation results