This is a T5v1.1 (small) trained on the concatenation of the Arabic Billion Words corpus and the Arabic subsets of the mC4 and Oscar datasets.
The model could only be trained for about
10% of the whole dataset due to time limitations. This is equivalent to
22'000 steps or about
4.3 Billion tokens.
|Training batch size||
|Evaluation batch size||
Preprocessing and the tokenizer
We tried to keep the preprocessing to a bare minimum. We only replaced URLs, emails and social media user mentions with fixed tokens.
Contrary to other pretrained Arabic LMs, we decided to not strip the Arabic diacritics and to keep them part of the vocabulary.
The tokenizer was trained on
5% of the training set, with a vocabulary size of
For more details about preprocessing, check the tokenizer code
The model was trained on the concatenation of the Arabic Billion Words corpus and the Arabic subsets of the mC4 and Oscar datasets.
0.1% subset of the data was reserved for evaluation and the rest for training.
Note for finetuning
This model was pretrained with dropout turned off, so the default
dropout_rate in the model config is
To finetune the model dropout should be turned be back on, like this:
model = T5ForConditionalGeneration.from_pretrained("flax-community/arabic-t5-small", dropout_rate=0.1)
model = AutoModelForSeq2SeqLM.from_pretrained("flax-community/arabic-t5-small", dropout_rate=0.1)
- Downloads last month