Initial model

Browse files

Files changed (9) hide show

README.md +87 -0
config.json +63 -0
eval_results.json +12 -0
pytorch_model.bin +3 -0
special_tokens_map.json +1 -0
tf_model.h5 +3 -0
tokenizer.json +0 -0
tokenizer_config.json +1 -0
vocab.txt +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,87 @@

+---
+language: is
+license: apache-2.0
+widget:
+ - text: "Kristin manneskja getur ekki lagt frásagnir af Jesú Kristi á hilluna vegna þess að hún sé búin að lesa þær ."
+ - text: "Til hvers að kjósa flokk , sem þykist vera Jafnaðarmannaflokkur rétt fyrir kosningar , þegar að það er hægt að kjósa sannnan jafnaðarmannaflokk , sjálfan Jafnaðarmannaflokk Íslands - Samfylkinguna ."
+ - text: "Það sannaðist svo eftirminnilega á plötunni Það þarf fólk eins og þig sem kom út fyrir þremur árum , en á henni hann Fálka úr Keflavík og Gáluna , son sinn , til að útsetja lög hans og spila inn ."
+ - text: "Lögin hafa áður komið út sem aukalög á smáskífum af Hail to the Thief , en á disknum er líka myndband og fleira efni fyrir tölvur ."
+ - text: "Britney gerði honum viðvart og hann ók henni á UCLA-sjúkrahúsið í Santa Monica en það er í nágrenni hljóðversins ."
+---
+# IcelandicNER DistilBERT
+This model was fine-tuned on the MIM-GOLD-NER dataset for the Icelandic language.
+The [MIM-GOLD-NER](http://hdl.handle.net/20.500.12537/42) corpus was developed at [Reykjavik University](https://en.ru.is/) in 2018–2020 that covered eight types of entities:
+- Date
+- Location
+- Miscellaneous
+- Money
+- Organization
+- Percent
+- Person
+- Time
+## Dataset Information
+|       |   Records |   B-Date |   B-Location |   B-Miscellaneous |   B-Money |   B-Organization |   B-Percent |   B-Person |   B-Time |   I-Date |   I-Location |   I-Miscellaneous |   I-Money |   I-Organization |   I-Percent |   I-Person |   I-Time |
+|:------|----------:|---------:|-------------:|------------------:|----------:|-----------------:|------------:|-----------:|---------:|---------:|-------------:|------------------:|----------:|-----------------:|------------:|-----------:|---------:|
+| Train |     39988 |     3409 |         5980 |              4351 |       729 |             5754 |         502 |      11719 |      868 |     2112 |          516 |              3036 |       770 |             2382 |          50 |       5478 |      790 |
+| Valid |      7063 |      570 |         1034 |               787 |       100 |             1078 |         103 |       2106 |      147 |      409 |           76 |               560 |       104 |              458 |           7 |        998 |      136 |
+| Test  |      8299 |      779 |         1319 |               935 |       153 |             1315 |         108 |       2247 |      172 |      483 |          104 |               660 |       167 |              617 |          10 |       1089 |      158 |
+## Evaluation
+The following tables summarize the scores obtained by model overall and per each class.
+|     entity    | precision |  recall  | f1-score | support |
+|:-------------:|:---------:|:--------:|:--------:|:-------:|
+|      Date     |  0.969309 | 0.973042 | 0.971172 |  779.0  |
+|    Location   |  0.941221 | 0.946929 | 0.944067 |  1319.0 |
+| Miscellaneous |  0.848283 | 0.819251 | 0.833515 |  935.0  |
+|     Money     |  0.928571 | 0.934641 | 0.931596 |  153.0  |
+|  Organization |  0.874147 | 0.876806 | 0.875475 |  1315.0 |
+|    Percent    |  1.000000 | 1.000000 | 1.000000 |  108.0  |
+|     Person    |  0.956674 | 0.972853 | 0.964695 |  2247.0 |
+|      Time     |  0.965318 | 0.970930 | 0.968116 |  172.0  |
+|   micro avg   |  0.926110 | 0.929141 | 0.927623 |  7028.0 |
+|   macro avg   |  0.935441 | 0.936807 | 0.936079 |  7028.0 |
+|  weighted avg |  0.925578 | 0.929141 | 0.927301 |  7028.0 |
+## How To Use
+You use this model with Transformers pipeline for NER.
+### Installing requirements
+```bash
+pip install transformers
+```
+### How to predict using pipeline
+```python
+from transformers import AutoTokenizer
+from transformers import AutoModelForTokenClassification  # for pytorch
+from transformers import TFAutoModelForTokenClassification  # for tensorflow
+from transformers import pipeline
+model_name_or_path = "m3hrdadfi/icelandic-ner-distilbert"
+tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
+model = AutoModelForTokenClassification.from_pretrained(model_name_or_path)  # Pytorch
+# model = TFAutoModelForTokenClassification.from_pretrained(model_name_or_path)  # Tensorflow
+nlp = pipeline("ner", model=model, tokenizer=tokenizer)
+example = "Kristin manneskja getur ekki lagt frásagnir af Jesú Kristi á hilluna vegna þess að hún sé búin að lesa þær ."
+ner_results = nlp(example)
+print(ner_results)
+```
+## Questions?
+Post a Github issue on the [IcelandicNER Issues](https://github.com/m3hrdadfi/icelandic-ner/issues) repo.

config.json ADDED Viewed

	@@ -0,0 +1,63 @@

+{
+  "_name_or_path": "distilbert-base-multilingual-cased",
+  "activation": "gelu",
+  "architectures": [
+    "DistilBertForTokenClassification"
+  ],
+  "attention_dropout": 0.1,
+  "dim": 768,
+  "dropout": 0.1,
+  "finetuning_task": "ner",
+  "hidden_dim": 3072,
+  "id2label": {
+    "0": "O",
+    "1": "B-Date",
+    "2": "B-Location",
+    "3": "B-Miscellaneous",
+    "4": "B-Money",
+    "5": "B-Organization",
+    "6": "B-Percent",
+    "7": "B-Person",
+    "8": "B-Time",
+    "9": "I-Date",
+    "10": "I-Location",
+    "11": "I-Miscellaneous",
+    "12": "I-Money",
+    "13": "I-Organization",
+    "14": "I-Percent",
+    "15": "I-Person",
+    "16": "I-Time"
+  },
+  "initializer_range": 0.02,
+  "label2id": {
+    "B-Date": 1,
+    "B-Location": 2,
+    "B-Miscellaneous": 3,
+    "B-Money": 4,
+    "B-Organization": 5,
+    "B-Percent": 6,
+    "B-Person": 7,
+    "B-Time": 8,
+    "I-Date": 9,
+    "I-Location": 10,
+    "I-Miscellaneous": 11,
+    "I-Money": 12,
+    "I-Organization": 13,
+    "I-Percent": 14,
+    "I-Person": 15,
+    "I-Time": 16,
+    "O": 0
+  },
+  "max_position_embeddings": 512,
+  "model_type": "distilbert",
+  "n_heads": 12,
+  "n_layers": 6,
+  "output_past": true,
+  "pad_token_id": 0,
+  "qa_dropout": 0.1,
+  "seq_classif_dropout": 0.2,
+  "sinusoidal_pos_embds": false,
+  "tie_weights_": true,
+  "transformers_version": "4.7.0.dev0",
+  "vocab_size": 119547
+}

eval_results.json ADDED Viewed

	@@ -0,0 +1,12 @@

+{
+    "epoch": 5.0,
+    "eval_accuracy": 0.9920402695963004,
+    "eval_f1": 0.8937531570971544,
+    "eval_loss": 0.04653976112604141,
+    "eval_precision": 0.8916512682681002,
+    "eval_recall": 0.8958649789029536,
+    "eval_runtime": 26.3373,
+    "eval_samples": 7063,
+    "eval_samples_per_second": 268.175,
+    "eval_steps_per_second": 16.782
+}

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:420497d1b21b5eee71dca7853173c0671c2d21cefc2ed683019e6c6a9926448c
+size 539031045

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}

tf_model.h5 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4e1e7f8e58759a90db57f579eb6d760d2b750876187cdcf47261e08db31a37d4
+size 539114912

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"do_lower_case": false, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "model_max_length": 512, "special_tokens_map_file": null, "name_or_path": "distilbert-base-multilingual-cased"}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff