heBERT oscar based v0.1

Files changed (5) hide show

README.md CHANGED Viewed

@@ -6,7 +6,7 @@ HeBERT is a Hebrew pretrained language model. It is based on Google's BERT archi
 ### HeBert was trained on three dataset:
 1. A Hebrew version of OSCAR [(Ortiz, 2019)](https://oscar-corpus.com/): ~9.8 GB of data, including 1 billion words and over 20.8 millions sentences.
-2. A Hebrew dump of Wikipedia: ~650 MB of data, including over 63 millions words and 3.8 millions sentences
 3. Emotion UGC data that was collected for the purpose of this study. (described below)
 We evaluated the model on emotion recognition and sentiment analysis, for a downstream tasks.

 ### HeBert was trained on three dataset:
 1. A Hebrew version of OSCAR [(Ortiz, 2019)](https://oscar-corpus.com/): ~9.8 GB of data, including 1 billion words and over 20.8 millions sentences.
+2. A Hebrew dump of [Wikipedia](https://dumps.wikimedia.org/hewiki/latest/): ~650 MB of data, including over 63 millions words and 3.8 millions sentences
 3. Emotion UGC data that was collected for the purpose of this study. (described below)
 We evaluated the model on emotion recognition and sentiment analysis, for a downstream tasks.

config.json CHANGED Viewed

@@ -10,12 +10,12 @@
   "initializer_range": 0.02,
   "intermediate_size": 3072,
   "layer_norm_eps": 1e-12,
-  "max_position_embeddings": 514,
   "model_type": "bert",
   "num_attention_heads": 12,
-  "num_hidden_layers": 6,
   "pad_token_id": 0,
-  "total_flos": 5334805906902687744,
-  "type_vocab_size": 1,
-  "vocab_size": 52000
 }

   "initializer_range": 0.02,
   "intermediate_size": 3072,
   "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
   "model_type": "bert",
   "num_attention_heads": 12,
+  "num_hidden_layers": 12,
   "pad_token_id": 0,
+  "total_flos": 6997313242916978688,
+  "type_vocab_size": 2,
+  "vocab_size": 30522
 }

pytorch_model.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:676f783f129da7e4c0b4c41714699c71d8b4c4ccb32cbfe6144cdc9320a10849
-size 334066887

 version https://git-lfs.github.com/spec/v1
+oid sha256:b219b9d76997d1933f01c362ae4fdc838600a5dcc5323869af1466959b74e6ed
+size 438146887

training_args.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:be8697550604f5396b096077b4e4ec135b7882886f91b64922c9b251a713f60d
 size 1775

 version https://git-lfs.github.com/spec/v1
+oid sha256:d028009bf2029df744e09caee6ef5c5fe830318d5e02b381011099c51f69c0d2
 size 1775

vocab.txt CHANGED Viewed

The diff for this file is too large to render. See raw diff