codewithkyrian
/

heBERT

ONNX

Transformers PHP

bert

Model card Files Files and versions Community

codewithkyrian commited on Aug 21

Commit

b549793

•

1 Parent(s): e07db4c

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +104 -0

README.md ADDED Viewed

	@@ -0,0 +1,104 @@

+---
+library_name: Transformers PHP
+tags:
+- onnx
+---
+https://huggingface.co/avichr/heBERT with ONNX weights to be compatible with Transformers PHP
+## HeBERT: Pre-trained BERT for Polarity Analysis and Emotion Recognition
+HeBERT is a Hebrew pretrained language model. It is based on Google's BERT architecture and it is BERT-Base config [(Devlin et al. 2018)](https://arxiv.org/abs/1810.04805). <br>
+### HeBert was trained on three dataset:
+1. A Hebrew version of OSCAR [(Ortiz, 2019)](https://oscar-corpus.com/): ~9.8 GB of data, including 1 billion words and over 20.8 millions sentences.
+2. A Hebrew dump of [Wikipedia](https://dumps.wikimedia.org/hewiki/latest/): ~650 MB of data, including over 63 millions words and 3.8 millions sentences
+3. Emotion UGC data that was collected for the purpose of this study. (described below)
+We evaluated the model on emotion recognition and sentiment analysis, for a downstream tasks.
+### Emotion UGC Data Description
+Our User Genrated Content (UGC) is comments written on articles collected from 3 major news sites, between January 2020 to August 2020,. Total data size ~150 MB of data, including over 7 millions words and 350K sentences.
+4000 sentences annotated by crowd members (3-10 annotators per sentence) for 8 emotions (anger, disgust, expectation , fear, happy, sadness, surprise and trust) and overall sentiment / polarity<br>
+In order to valid the annotation, we search an agreement between raters to emotion in each sentence using krippendorff's alpha [(krippendorff, 1970)](https://journals.sagepub.com/doi/pdf/10.1177/001316447003000105). We left sentences that got alpha > 0.7. Note that while we found a general agreement between raters about emotion like happy, trust and disgust, there are few emotion with general disagreement about them, apparently given the complexity of finding them in the text (e.g. expectation and surprise).
+## How to use
+### For masked-LM model (can be fine-tunned to any down-stream task)
+```
+from transformers import AutoTokenizer, AutoModel
+tokenizer = AutoTokenizer.from_pretrained("avichr/heBERT")
+model = AutoModel.from_pretrained("avichr/heBERT")
+from transformers import pipeline
+fill_mask = pipeline(
+    "fill-mask",
+    model="avichr/heBERT",
+    tokenizer="avichr/heBERT"
+)
+fill_mask("הקורונה לקחה את [MASK] ולנו לא נשאר דבר.")
+```
+### For sentiment classification model (polarity ONLY):
+```
+from transformers import AutoTokenizer, AutoModel, pipeline
+tokenizer = AutoTokenizer.from_pretrained("avichr/heBERT_sentiment_analysis") #same as 'avichr/heBERT' tokenizer
+model = AutoModel.from_pretrained("avichr/heBERT_sentiment_analysis")
+# how to use?
+sentiment_analysis = pipeline(
+    "sentiment-analysis",
+    model="avichr/heBERT_sentiment_analysis",
+    tokenizer="avichr/heBERT_sentiment_analysis",
+    return_all_scores = True
+)
+>>>  sentiment_analysis('אני מתלבט מה לאכול לארוחת צהריים')
+[[{'label': 'natural', 'score': 0.9978172183036804},
+{'label': 'positive', 'score': 0.0014792329166084528},
+{'label': 'negative', 'score': 0.0007035882445052266}]]
+>>>  sentiment_analysis('קפה זה טעים')
+[[{'label': 'natural', 'score': 0.00047328314394690096},
+{'label': 'possitive', 'score': 0.9994067549705505},
+{'label': 'negetive', 'score': 0.00011996887042187154}]]
+>>>  sentiment_analysis('אני לא אוהב את העולם')
+[[{'label': 'natural', 'score': 9.214012970915064e-05},
+{'label': 'possitive', 'score': 8.876807987689972e-05},
+{'label': 'negetive', 'score': 0.9998190999031067}]]
+```
+Our model is also available on AWS! for more information visit [AWS' git](https://github.com/aws-samples/aws-lambda-docker-serverless-inference/tree/main/hebert-sentiment-analysis-inference-docker-lambda)
+### For NER model:
+```
+	from transformers import pipeline
+	# how to use?
+	NER = pipeline(
+	    "token-classification",
+	    model="avichr/heBERT_NER",
+	    tokenizer="avichr/heBERT_NER",
+	)
+	NER('דויד לומד באוניברסיטה העברית שבירושלים')
+```
+## Stay tuned!
+We are still working on our model and will edit this page as we progress.<br>
+Note that we have released only sentiment analysis (polarity) at this point, emotion detection will be released later on.<br>
+our git: https://github.com/avichaychriqui/HeBERT
+## If you use this model please cite us as :
+Chriqui, A., & Yahav, I. (2022). HeBERT & HebEMO: a Hebrew BERT Model and a Tool for Polarity Analysis and Emotion Recognition. INFORMS Journal on Data Science, forthcoming.
+```
+@article{chriqui2021hebert,
+  title={HeBERT \& HebEMO: a Hebrew BERT Model and a Tool for Polarity Analysis and Emotion Recognition},
+  author={Chriqui, Avihay and Yahav, Inbal},
+  journal={INFORMS Journal on Data Science},
+  year={2022}
+}
+```
+---
+Note: Having a separate repo for ONNX weights is intended to be a temporary solution until WebML gains more traction. If you would like to make your models web-ready, we recommend converting to ONNX using [🤗 Optimum](https://huggingface.co/docs/optimum/index) and structuring your repo like this one (with ONNX weights located in a subfolder named `onnx`).