AraT5-msa-base / README.md
elmadany's picture
Update README.md
50b99b9
|
raw
history blame
4.86 kB

AraT5-msa-base

drawing

AraT5-msa-base is one of three models described in our AraT5: Text-to-Text Transformers for Arabic Language Understanding and Generation . In this paper, we introduce three powerful Arabic-specific text-to-text transformer models trained on large Modern Standard Arabic (MSA) and/or Dialectal Arabic (DA) data. AraT5 is trained on 248GB of text (29B tokens) of MSA and DA, AraT5-msa is trained on 70GB of text (7.1B tokens) from MSA data, and AraT5-tweet is trained on 178Gb of text (21.9B tokens) from 1.5B Arabic tweets which contains multiple varieties of dialectical Arabic.

In addition, we provide the three models on two architectures small and base. For all models, we use a learning rate of 0.01, a batch size of 128 sequences, and a maximum sequence length of 512 whereas AraT5-tweet 128 maximum sequence is used. Hence, the original implementation of T5 in the TensorFlow framework is used to train the models. We train the models for 1M steps.8 Training took ∼ 80 days on 1 on Google Cloud TPU with 8 cores (v3.8) from TensorFlow Research Cloud (TFRC).

How to use AraT5 models

Below is an example for fine-tuning AraT5-base for News Title Generation on the Aranews dataset

!python run_trainier_seq2seq_huggingface.py \
        --learning_rate 5e-5 \
        --max_target_length 128 --max_source_length 128 \
        --per_device_train_batch_size 8 --per_device_eval_batch_size 8 \
        --model_name_or_path "UBC-NLP/AraT5-base" \
        --output_dir "/content/AraT5_FT_title_generation" --overwrite_output_dir \
        --num_train_epochs 3 \
        --train_file "/content/ARGEn_title_genration_sample_train.tsv" \
        --validation_file "/content/ARGEn_title_genration_sample_valid.tsv" \
        --task "title_generation" --text_column "document" --summary_column "title" \
        --load_best_model_at_end --metric_for_best_model "eval_bleu" --greater_is_better True --evaluation_strategy epoch --logging_strategy epoch --predict_with_generate\
        --do_train --do_eval

For more details about the fine-tuning example, please read this notebook Open In Colab

In addition, we release the fine-tuned checkpoint of the News Title Generation (NGT) which is described in the paper. The model available at Huggingface (UBC-NLP/AraT5-base-title-generation).

For more details, please visit our own GitHub.

AraT5 Models Checkpoints

AraT5 Pytorch and TensorFlow checkpoints are available on the Huggingface website for direct download and use exclusively for research. For commercial use, please contact the authors via email @ (muhammad.mageed[at]ubc[dot]ca).

BibTex

If you use our models (Arat5-base, Arat5-msa-base, Arat5-tweet-base, Arat5-msa-small, or Arat5-tweet-small ) for your scientific publication, or if you find the resources in this repository useful, please cite our paper as follows (to be updated):

@inproceedings{araT5-2021,
    title = "{AraT5: Text-to-Text Transformers for Arabic Language Understanding and Generation",
    author = "Nagoudi, El Moatez Billah  and
      Elmadany, AbdelRahim  and
      Abdul-Mageed, Muhammad",
    booktitle = "https://arxiv.org/abs/2109.12068",
    month = aug,
    year = "2021"}

Acknowledgments

We gratefully acknowledge support from the Natural Sciences and Engineering Research Council of Canada, the Social Sciences and Humanities Research Council of Canada, Canadian Foundation for Innovation, ComputeCanada and UBC ARC-Sockeye. We also thank the Google TensorFlow Research Cloud (TFRC) program for providing us with free TPU access.