Initialize

Browse files

Files changed (8) hide show

README.md +90 -0
config.json +70 -0
pytorch_model.bin +3 -0
special_tokens_map.json +7 -0
tf_model.h5 +3 -0
tokenizer.json +0 -0
tokenizer_config.json +13 -0
vocab.txt +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,90 @@

+---
+language: fa
+---
+# DistilbertNER
+This model fine-tuned for the Named Entity Recognition (NER) task on a mixed NER dataset collected from [ARMAN](https://github.com/HaniehP/PersianNER), [PEYMA](http://nsurl.org/2019-2/tasks/task-7-named-entity-recognition-ner-for-farsi/), and [WikiANN](https://elisa-ie.github.io/wikiann/) that covered ten types of entities:
+- Date (DAT)
+- Event (EVE)
+- Facility (FAC)
+- Location (LOC)
+- Money (MON)
+- Organization (ORG)
+- Percent (PCT)
+- Person (PER)
+- Product (PRO)
+- Time (TIM)
+## Dataset Information
+|       |   Records |   B-DAT |   B-EVE |   B-FAC |   B-LOC |   B-MON |   B-ORG |   B-PCT |   B-PER |   B-PRO |   B-TIM |   I-DAT |   I-EVE |   I-FAC |   I-LOC |   I-MON |   I-ORG |   I-PCT |   I-PER |   I-PRO |   I-TIM |
+|:------|----------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|
+| Train |     29133 |    1423 |    1487 |    1400 |   13919 |     417 |   15926 |     355 |   12347 |    1855 |     150 |    1947 |    5018 |    2421 |    4118 |    1059 |   19579 |     573 |    7699 |    1914 |     332 |
+| Valid |      5142 |     267 |     253 |     250 |    2362 |     100 |    2651 |      64 |    2173 |     317 |      19 |     373 |     799 |     387 |     717 |     270 |    3260 |     101 |    1382 |     303 |      35 |
+| Test  |      6049 |     407 |     256 |     248 |    2886 |      98 |    3216 |      94 |    2646 |     318 |      43 |     568 |     888 |     408 |     858 |     263 |    3967 |     141 |    1707 |     296 |      78 |
+## Evaluation
+The following tables summarize the scores obtained by model overall and per each class.
+**Overall**
+|    Model   | accuracy | precision |  recall  |    f1    |
+|:----------:|:--------:|:---------:|:--------:|:--------:|
+| Distilbert | 0.994534 |  0.946326 |  0.95504 | 0.950663 |
+**Per entities**
+|     	| number 	| precision 	|  recall  	|    f1    	|
+|:---:	|:------:	|:---------:	|:--------:	|:--------:	|
+| DAT 	|   407  	|  0.812048 	| 0.828010 	| 0.819951 	|
+| EVE 	|   256  	|  0.955056 	| 0.996094 	| 0.975143 	|
+| FAC 	|   248  	|  0.972549 	| 1.000000 	| 0.986083 	|
+| LOC 	|  2884  	|  0.968403 	| 0.967060 	| 0.967731 	|
+| MON 	|   98   	|  0.925532 	| 0.887755 	| 0.906250 	|
+| ORG 	|  3216  	|  0.932095 	| 0.951803 	| 0.941846 	|
+| PCT 	|   94   	|  0.936842 	| 0.946809 	| 0.941799 	|
+| PER 	|  2645  	|  0.959818 	| 0.957278 	| 0.958546 	|
+| PRO 	|   318  	|  0.963526 	| 0.996855 	| 0.979907 	|
+| TIM 	|   43   	|  0.760870 	| 0.813953 	| 0.786517 	|
+## How To Use
+You use this model with Transformers pipeline for NER.
+### Installing requirements
+```bash
+pip install transformers
+```
+### How to predict using pipeline
+```python
+from transformers import AutoTokenizer
+from transformers import AutoModelForTokenClassification  # for pytorch
+from transformers import TFAutoModelForTokenClassification  # for tensorflow
+from transformers import pipeline
+model_name_or_path = "HooshvareLab/distilbert-fa-zwnj-base-ner"
+tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
+model = AutoModelForTokenClassification.from_pretrained(model_name_or_path)  # Pytorch
+# model = TFAutoModelForTokenClassification.from_pretrained(model_name_or_path)  # Tensorflow
+nlp = pipeline("ner", model=model, tokenizer=tokenizer)
+example = "در سال ۲۰۱۳ درگذشت و آندرتیکر و کین برای او مراسم یادبود گرفتند."
+ner_results = nlp(example)
+print(ner_results)
+```
+## Questions?
+Post a Github issue on the [ParsNER Issues](https://github.com/hooshvare/parsner/issues) repo.

config.json ADDED Viewed

	@@ -0,0 +1,70 @@

+{
+  "activation": "gelu",
+  "architectures": [
+    "DistilBertForTokenClassification"
+  ],
+  "attention_dropout": 0.1,
+  "dim": 768,
+  "dropout": 0.1,
+  "finetuning_task": "ner",
+  "hidden_dim": 3072,
+  "id2label": {
+    "0": "O",
+    "1": "B-DAT",
+    "2": "B-EVE",
+    "3": "B-FAC",
+    "4": "B-LOC",
+    "5": "B-MON",
+    "6": "B-ORG",
+    "7": "B-PCT",
+    "8": "B-PER",
+    "9": "B-PRO",
+    "10": "B-TIM",
+    "11": "I-DAT",
+    "12": "I-EVE",
+    "13": "I-FAC",
+    "14": "I-LOC",
+    "15": "I-MON",
+    "16": "I-ORG",
+    "17": "I-PCT",
+    "18": "I-PER",
+    "19": "I-PRO",
+    "20": "I-TIM"
+  },
+  "initializer_range": 0.02,
+  "label2id": {
+    "B-DAT": 1,
+    "B-EVE": 2,
+    "B-FAC": 3,
+    "B-LOC": 4,
+    "B-MON": 5,
+    "B-ORG": 6,
+    "B-PCT": 7,
+    "B-PER": 8,
+    "B-PRO": 9,
+    "B-TIM": 10,
+    "I-DAT": 11,
+    "I-EVE": 12,
+    "I-FAC": 13,
+    "I-LOC": 14,
+    "I-MON": 15,
+    "I-ORG": 16,
+    "I-PCT": 17,
+    "I-PER": 18,
+    "I-PRO": 19,
+    "I-TIM": 20,
+    "O": 0
+  },
+  "max_position_embeddings": 512,
+  "model_type": "distilbert",
+  "n_heads": 12,
+  "n_layers": 6,
+  "output_past": true,
+  "pad_token_id": 0,
+  "qa_dropout": 0.1,
+  "seq_classif_dropout": 0.2,
+  "sinusoidal_pos_embds": false,
+  "tie_weights_": true,
+  "transformers_version": "4.5.0.dev0",
+  "vocab_size": 42000
+}

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:851d3497622637fc228341fb644311988fee7c48011174e86599ecda829862ae
+size 300818949

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+    "unk_token": "[UNK]",
+    "sep_token": "[SEP]",
+    "pad_token": "[PAD]",
+    "cls_token": "[CLS]",
+    "mask_token": "[MASK]"
+}

tf_model.h5 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3f5ef9e2e137ae4d41c50ef03eeb33b02c9274653e22d251386897f7e7604052
+size 300903008

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,13 @@

+{
+    "do_lower_case": false,
+    "unk_token": "[UNK]",
+    "sep_token": "[SEP]",
+    "pad_token": "[PAD]",
+    "cls_token": "[CLS]",
+    "mask_token": "[MASK]",
+    "tokenize_chinese_chars": true,
+    "strip_accents": false,
+    "model_max_length": 512,
+    "special_tokens_map_file": null,
+    "name_or_path": "HooshvareLab/distilbert-fa-zwnj-base"
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff