Elron commited on
Commit
0643e83
1 Parent(s): c46ed17

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +53 -0
README.md ADDED
@@ -0,0 +1,53 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - he
4
+ tags:
5
+ - language model
6
+ license: apache-2.0
7
+ datasets:
8
+ - oscar
9
+ - wikipedia
10
+ - twitter
11
+ ---
12
+
13
+ # AlephBERT
14
+
15
+ ## Hebrew Language Model
16
+
17
+ State-of-the-art language model for Hebrew.
18
+ Based on Google's BERT architecture [(Devlin et al. 2018)](https://arxiv.org/abs/1810.04805).
19
+
20
+ #### How to use
21
+
22
+ ```python
23
+ from transformers import BertModel, BertTokenizerFast
24
+
25
+ alephbert_tokenizer = BertTokenizerFast.from_pretrained('onlplab/alephbert-base')
26
+ alephbert = BertModel.from_pretrained('onlplab/alephbert-base')
27
+
28
+ # if not finetuning - disable dropout
29
+ alephbert.eval()
30
+ ```
31
+
32
+ ## Training data
33
+ 1. OSCAR [(Ortiz, 2019)](https://oscar-corpus.com/) Hebrew section (10 GB text, 20 million sentences).
34
+ 2. Hebrew dump of [Wikipedia](https://dumps.wikimedia.org/hewiki/latest/) (650 MB text, 3 million sentences).
35
+ 3. Hebrew Tweets collected from the Twitter sample stream (7 GB text, 70 million sentences).
36
+
37
+ ## Training procedure
38
+
39
+ Trained on a DGX machine (8 V100 GPUs) using the standard huggingface training procedure.
40
+
41
+ Since the larger part of our training data is based on tweets we decided to start by optimizing using Masked Language Model loss only.
42
+
43
+ To optimize training time we split the data into 4 sections based on max number of tokens:
44
+
45
+ 1. num tokens < 32 (70M sentences)
46
+ 2. 32 <= num tokens < 64 (12M sentences)
47
+ 3. 64 <= num tokens < 128 (10M sentences)
48
+ 4. 128 <= num tokens < 512 (1.5M sentences)
49
+
50
+ Each section was first trained for 5 epochs with an initial learning rate set to 1e-4. Then each section was trained for another 5 epochs with an initial learning rate set to 1e-5, for a total of 10 epochs.
51
+
52
+ Total training time was 8 days.
53
+