Update README.md
Browse files
README.md
CHANGED
@@ -9,26 +9,43 @@ datasets:
|
|
9 |
|
10 |
# arabic-t5-small
|
11 |
|
12 |
-
This is a T5v1.1 (small) trained on the concatenation of the Arabic Billion Words corpus and the Arabic subsets of the mC4 and Oscar datasets.
|
|
|
|
|
13 |
|
14 |
## Training parameters
|
15 |
|
16 |
| | |
|
17 |
| :-------------------: | :-----------: |
|
18 |
-
| steps | `22'000` |
|
19 |
| Training batch size | `384` |
|
20 |
| Evaluation batch size | `768` |
|
21 |
| learning rate | `1e-2` |
|
22 |
| dtype | `jnp.float32` |
|
23 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
24 |
## Results
|
25 |
|
26 |
| | |
|
27 |
| :-----------------: | :-----------: |
|
28 |
-
|
|
29 |
-
|
|
30 |
-
|
|
31 |
-
|
|
32 |
|
33 |
## Note for finetuning
|
34 |
|
|
|
9 |
|
10 |
# arabic-t5-small
|
11 |
|
12 |
+
This is a T5v1.1 (small) trained on the concatenation of the Arabic Billion Words corpus and the Arabic subsets of the mC4 and Oscar datasets.
|
13 |
+
|
14 |
+
The model could only be trained for about `10%` of the whole dataset due to time limitations. This is equivalent to `22'000` steps or about `4.3` Billion tokens.
|
15 |
|
16 |
## Training parameters
|
17 |
|
18 |
| | |
|
19 |
| :-------------------: | :-----------: |
|
|
|
20 |
| Training batch size | `384` |
|
21 |
| Evaluation batch size | `768` |
|
22 |
| learning rate | `1e-2` |
|
23 |
| dtype | `jnp.float32` |
|
24 |
|
25 |
+
## Preprocessing and the tokenizer
|
26 |
+
|
27 |
+
We tried to keep the preprocessing to the bare minimum. We ony replaced URLs, emails and social media user mentions with fixed tokens.
|
28 |
+
|
29 |
+
Contrary to other pretrained Arabic LMs, we decided to not strip the Arabic diacritics and to keep them in the vocabulary.
|
30 |
+
|
31 |
+
The tokenizer was trained on `5%` of the training set, with a vocabulary size of `64'000`.
|
32 |
+
|
33 |
+
For more details about preprocessing, check the [tokenizer code](https://huggingface.co/flax-community/arabic-t5-small/blob/main/t5_tokenizer_model.py)
|
34 |
+
|
35 |
+
## Data
|
36 |
+
|
37 |
+
The model was trained on the concatenation of the Arabic Billion Words corpus and the Arabic subsets of the mC4 and Oscar datasets.
|
38 |
+
|
39 |
+
A random `0.1%` subset of the data was reserved for evaluation and the rest for training.
|
40 |
+
|
41 |
## Results
|
42 |
|
43 |
| | |
|
44 |
| :-----------------: | :-----------: |
|
45 |
+
| Evaluation accuracy | `56.84%` |
|
46 |
+
| Evaluation Loss | `2.423` |
|
47 |
+
| Training Loss | `2.392` |
|
48 |
+
| Training Time | `22h 23m 51s` |
|
49 |
|
50 |
## Note for finetuning
|
51 |
|