avichr commited on
Commit
3011131
1 Parent(s): 8d0eae4
Files changed (6) hide show
  1. README.md +32 -0
  2. config.json +21 -0
  3. log_history.json +1 -0
  4. pytorch_model.bin +3 -0
  5. training_args.bin +3 -0
  6. vocab.txt +0 -0
README.md ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # THIS IS BETA REPO
2
+ We will release a better one soon:)
3
+
4
+ <br><br>
5
+
6
+ ## HeBERT: Pre-trained BERT for Polarity Analysis and Emotion Recognition
7
+ HeBERT is a Hebrew pretrained language model. It is based on Google's BERT architecture and it is BERT-Base config [(Devlin et al. 2018)](https://arxiv.org/abs/1810.04805). <br>
8
+
9
+ HeBert was trained on three dataset:
10
+ 1. A Hebrew version of OSCAR [(Ortiz, 2019)](https://oscar-corpus.com/): ~9.8 GB of data, including 1 billion words and over 20.8 millions sentences.
11
+ 2. A Hebrew dump of Wikipedia: ~650 MB of data, including over 63 millions words and 3.8 millions sentences
12
+ 3. Emotion UGC data that was collected for the purpose of this study. (described below)
13
+ We evaluated the model on emotion recognition and sentiment analysis, for a downstream tasks.
14
+
15
+ ### Emotion UGC Data Description
16
+ Our User Genrated Content (UGC) is comments written on articles collected from 3 major news sites, between January 2020 to August 2020,. Total data size ~150 MB of data, including over 7 millions words and 350K sentences.
17
+ 4000 sentences annotated by crowd members (3-10 annotators per sentence) for 8 emotions (anger, disgust, expectation , fear, happy, sadness, surprise and trust) and overall sentiment / polarity<br>
18
+ In order to valid the annotation, we search an agreement between raters to emotion in each sentence using krippendorff's alpha [(krippendorff, 1970)](https://journals.sagepub.com/doi/pdf/10.1177/001316447003000105). We left sentences that got alpha > 0.7. Note that while we found a general agreement between raters about emotion like happy, trust and disgust, there are few emotion with general disagreement about them, apparently given the complexity of finding them in the text (e.g. expectation and surprise).
19
+
20
+ ### Performance
21
+ #### sentiment analysis
22
+ | | precision | recall | f1-score |
23
+ |--------------|-----------|--------|----------|
24
+ | 0 | 0.95 | 0.95 | 0.95 |
25
+ | 1 | 0.90 | 0.90 | 0.90 |
26
+ | accuracy | | | 0.93 |
27
+ | macro avg | 0.92 | 0.93 | 0.92 |
28
+ | weighted avg | 0.93 | 0.93 | 0.93 |
29
+
30
+ trained on [Amram, A., Ben-David, A., and Tsarfaty, R. (2018) dataset](https://github.com/omilab/Neural-Sentiment-Analyzer-for-Modern-Hebrew) and ours (will publish soon)
31
+ evaluated by test data set by Amram et al.
32
+
config.json ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "BertForSequenceClassification"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "gradient_checkpointing": false,
7
+ "hidden_act": "gelu",
8
+ "hidden_dropout_prob": 0.1,
9
+ "hidden_size": 768,
10
+ "initializer_range": 0.02,
11
+ "intermediate_size": 3072,
12
+ "layer_norm_eps": 1e-12,
13
+ "max_position_embeddings": 514,
14
+ "model_type": "bert",
15
+ "num_attention_heads": 12,
16
+ "num_hidden_layers": 6,
17
+ "pad_token_id": 0,
18
+ "total_flos": 9596594117426400,
19
+ "type_vocab_size": 1,
20
+ "vocab_size": 52000
21
+ }
log_history.json ADDED
@@ -0,0 +1 @@
 
1
+ []
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e6af80595db107ad7e9b4d7bdfc78cd27cff303e51924ea7ea40de8668a31ac2
3
+ size 333858111
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5bb1827cfb9c0ebf5cea54488dbf7f4e2d63ce2d8245646997d7fed0daea5a49
3
+ size 1839
vocab.txt ADDED
The diff for this file is too large to render. See raw diff