File size: 2,152 Bytes
37e00b7
 
 
 
 
 
 
 
a5ace5e
37e00b7
 
e4555b1
 
 
37e00b7
 
 
 
 
 
 
 
 
 
e4555b1
 
887b7a5
e4555b1
887b7a5
e4555b1
 
 
 
 
 
 
 
 
 
 
fd87949
37e00b7
fd87949
 
e4555b1
 
 
 
fd87949
 
37e00b7
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
---
language:
  - ar
datasets:
  - mc4
  - oscar
  - arabic_billion_words
---

# arabic-t5-small

This is a T5v1.1 (small) trained on the concatenation of the Arabic Billion Words corpus and the Arabic subsets of the mC4 and Oscar datasets.

The model could only be trained for about `10%` of the whole dataset due to time limitations. This is equivalent to `22'000` steps or about `4.3` Billion tokens.

## Training parameters

|                       |               |
| :-------------------: | :-----------: |
|  Training batch size  |     `384`     |
| Evaluation batch size |     `768`     |
|     learning rate     |    `1e-2`     |
|         dtype         | `jnp.float32` |

## Preprocessing and the tokenizer

We tried to keep the preprocessing to a bare minimum. We only replaced URLs, emails and social media user mentions with fixed tokens.

Contrary to other pretrained Arabic LMs, we decided to not strip the Arabic diacritics and to keep them part of the vocabulary.

The tokenizer was trained on `5%` of the training set, with a vocabulary size of `64'000`.

For more details about preprocessing, check the [tokenizer code](https://huggingface.co/flax-community/arabic-t5-small/blob/main/t5_tokenizer_model.py)

## Data

The model was trained on the concatenation of the Arabic Billion Words corpus and the Arabic subsets of the mC4 and Oscar datasets.

A random `0.1%` subset of the data was reserved for evaluation and the rest for training.

## Results

|                     |               |
| :-----------------: | :-----------: |
| Evaluation accuracy |   `56.84%`    |
|   Evaluation Loss   |    `2.423`    |
|    Training Loss    |    `2.392`    |
|    Training Time    | `22h 23m 51s` |

## Note for finetuning

This model was pretrained with dropout turned off, so the default `dropout_rate` in the model config is `0`.
To finetune the model dropout should be turned be back on, like this:

```python
model = T5ForConditionalGeneration.from_pretrained("flax-community/arabic-t5-small", dropout_rate=0.1)
```

or,

```python
model = AutoModelForSeq2SeqLM.from_pretrained("flax-community/arabic-t5-small", dropout_rate=0.1)
```