flax-community
/

arabic-t5-small

Text2Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Metrics Training metrics Community

salti commited on Jul 29, 2021

Commit

e4555b1

•

1 Parent(s): fd87949

Update README.md

Files changed (1) hide show

README.md +23 -6

README.md CHANGED Viewed

@@ -9,26 +9,43 @@ datasets:
 # arabic-t5-small
-This is a T5v1.1 (small) trained on the concatenation of the Arabic Billion Words corpus and the Arabic subsets of the mC4 and Oscar datasets. The model could only be trained for about `10%` of the whole dataset due to time limitations.
 ## Training parameters
 |                       |               |
 | :-------------------: | :-----------: |
-|         steps         |   `22'000`    |
 |  Training batch size  |     `384`     |
 | Evaluation batch size |     `768`     |
 |     learning rate     |    `1e-2`     |
 |         dtype         | `jnp.float32` |
 ## Results
 |                     |               |
 | :-----------------: | :-----------: |
-| evaluation accuracy |   `56.84%`    |
-|   evaluation loss   |    `2.423`    |
-|    training loss    |    `2.392`    |
-|    training time    | `22h 23m 51s` |
 ## Note for finetuning

 # arabic-t5-small
+This is a T5v1.1 (small) trained on the concatenation of the Arabic Billion Words corpus and the Arabic subsets of the mC4 and Oscar datasets.
+The model could only be trained for about `10%` of the whole dataset due to time limitations. This is equivalent to `22'000` steps or about `4.3` Billion tokens.
 ## Training parameters
 |                       |               |
 | :-------------------: | :-----------: |
 |  Training batch size  |     `384`     |
 | Evaluation batch size |     `768`     |
 |     learning rate     |    `1e-2`     |
 |         dtype         | `jnp.float32` |
+## Preprocessing and the tokenizer
+We tried to keep the preprocessing to the bare minimum. We ony replaced URLs, emails and social media user mentions with fixed tokens.
+Contrary to other pretrained Arabic LMs, we decided to not strip the Arabic diacritics and to keep them in the vocabulary.
+The tokenizer was trained on `5%` of the training set, with a vocabulary size of `64'000`.
+For more details about preprocessing, check the [tokenizer code](https://huggingface.co/flax-community/arabic-t5-small/blob/main/t5_tokenizer_model.py)
+## Data
+The model was trained on the concatenation of the Arabic Billion Words corpus and the Arabic subsets of the mC4 and Oscar datasets.
+A random `0.1%` subset of the data was reserved for evaluation and the rest for training.
 ## Results
 |                     |               |
 | :-----------------: | :-----------: |
+| Evaluation accuracy |   `56.84%`    |
+|   Evaluation Loss   |    `2.423`    |
+|    Training Loss    |    `2.392`    |
+|    Training Time    | `22h 23m 51s` |
 ## Note for finetuning