salti commited on
Commit
e4555b1
1 Parent(s): fd87949

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +23 -6
README.md CHANGED
@@ -9,26 +9,43 @@ datasets:
9
 
10
  # arabic-t5-small
11
 
12
- This is a T5v1.1 (small) trained on the concatenation of the Arabic Billion Words corpus and the Arabic subsets of the mC4 and Oscar datasets. The model could only be trained for about `10%` of the whole dataset due to time limitations.
 
 
13
 
14
  ## Training parameters
15
 
16
  | | |
17
  | :-------------------: | :-----------: |
18
- | steps | `22'000` |
19
  | Training batch size | `384` |
20
  | Evaluation batch size | `768` |
21
  | learning rate | `1e-2` |
22
  | dtype | `jnp.float32` |
23
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
  ## Results
25
 
26
  | | |
27
  | :-----------------: | :-----------: |
28
- | evaluation accuracy | `56.84%` |
29
- | evaluation loss | `2.423` |
30
- | training loss | `2.392` |
31
- | training time | `22h 23m 51s` |
32
 
33
  ## Note for finetuning
34
 
 
9
 
10
  # arabic-t5-small
11
 
12
+ This is a T5v1.1 (small) trained on the concatenation of the Arabic Billion Words corpus and the Arabic subsets of the mC4 and Oscar datasets.
13
+
14
+ The model could only be trained for about `10%` of the whole dataset due to time limitations. This is equivalent to `22'000` steps or about `4.3` Billion tokens.
15
 
16
  ## Training parameters
17
 
18
  | | |
19
  | :-------------------: | :-----------: |
 
20
  | Training batch size | `384` |
21
  | Evaluation batch size | `768` |
22
  | learning rate | `1e-2` |
23
  | dtype | `jnp.float32` |
24
 
25
+ ## Preprocessing and the tokenizer
26
+
27
+ We tried to keep the preprocessing to the bare minimum. We ony replaced URLs, emails and social media user mentions with fixed tokens.
28
+
29
+ Contrary to other pretrained Arabic LMs, we decided to not strip the Arabic diacritics and to keep them in the vocabulary.
30
+
31
+ The tokenizer was trained on `5%` of the training set, with a vocabulary size of `64'000`.
32
+
33
+ For more details about preprocessing, check the [tokenizer code](https://huggingface.co/flax-community/arabic-t5-small/blob/main/t5_tokenizer_model.py)
34
+
35
+ ## Data
36
+
37
+ The model was trained on the concatenation of the Arabic Billion Words corpus and the Arabic subsets of the mC4 and Oscar datasets.
38
+
39
+ A random `0.1%` subset of the data was reserved for evaluation and the rest for training.
40
+
41
  ## Results
42
 
43
  | | |
44
  | :-----------------: | :-----------: |
45
+ | Evaluation accuracy | `56.84%` |
46
+ | Evaluation Loss | `2.423` |
47
+ | Training Loss | `2.392` |
48
+ | Training Time | `22h 23m 51s` |
49
 
50
  ## Note for finetuning
51