File size: 1,120 Bytes
8b36f5b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
---
language:
- he 
tags:
- language model
license: apache-2.0
datasets:
- oscar
- wikipedia
- twitter 
---

# AlephBERT

## Hebrew Language Model

State-of-the-art language model for Hebrew. Based on BERT.

#### How to use

```python
from transformers import BertModel, BertTokenizerFast

alephbert_tokenizer = BertTokenizerFast.from_pretrained('onlplab/alephbert-base')
alephbert = BertModel.from_pretrained('onlplab/alephbert-base')

# if not finetuning - disable dropout
alephbert.eval()
```

## Training data

- OSCAR (10G text, 20M sentences)
- Wikipedia dump (0.6G text, 3M sentences)
- Tweets (7G text, 70M sentences)

## Training procedure

Trained on a DGX machine (8 V100 GPUs) using the standard huggingface training procedure.

To optimize training time we split the data into 4 sections based on max number of tokens:

1. num tokens < 32 (70M sentences)
2. 32 <= num tokens < 64 (12M sentences)
3. 64 <= num tokens < 128 (10M sentences)
4. 128 <= num tokens < 512 (70M sentences)

Each section was trained for 5 epochs with an initial learning rate set to 1e-4.

Total training time was 5 days.

## Eval