aseker00 commited on
Commit
f103098
1 Parent(s): 8b36f5b

Update readme.

Browse files
Files changed (1) hide show
  1. README.md +5 -7
README.md CHANGED
@@ -14,7 +14,8 @@ datasets:
14
 
15
  ## Hebrew Language Model
16
 
17
- State-of-the-art language model for Hebrew. Based on BERT.
 
18
 
19
  #### How to use
20
 
@@ -29,10 +30,9 @@ alephbert.eval()
29
  ```
30
 
31
  ## Training data
32
-
33
- - OSCAR (10G text, 20M sentences)
34
- - Wikipedia dump (0.6G text, 3M sentences)
35
- - Tweets (7G text, 70M sentences)
36
 
37
  ## Training procedure
38
 
@@ -49,6 +49,4 @@ Each section was trained for 5 epochs with an initial learning rate set to 1e-4.
49
 
50
  Total training time was 5 days.
51
 
52
- ## Eval
53
-
54
 
 
14
 
15
  ## Hebrew Language Model
16
 
17
+ State-of-the-art language model for Hebrew.
18
+ Based on Google's BERT architecture [(Devlin et al. 2018)](https://arxiv.org/abs/1810.04805).
19
 
20
  #### How to use
21
 
 
30
  ```
31
 
32
  ## Training data
33
+ 1. OSCAR [(Ortiz, 2019)](https://oscar-corpus.com/) Hebrew section (10GB text, 20M sentences).
34
+ 2. Hebrew dump of [Wikipedia](https://dumps.wikimedia.org/hewiki/latest/) (650 MB text, 3.8M sentences).
35
+ 3. Hebrew Tweets collected from the Twitter sample stream (7G text, 70M sentences).
 
36
 
37
  ## Training procedure
38
 
 
49
 
50
  Total training time was 5 days.
51
 
 
 
52