initial commit

Browse files

Files changed (7) hide show

README.md +48 -0
config.json +35 -0
pytorch_model.bin +3 -0
sentencepiece.bpe.model +3 -0
special_tokens_map.json +1 -0
tf_model.h5 +3 -0
tokenizer_config.json +1 -0

README.md ADDED Viewed

	@@ -0,0 +1,48 @@

+---
+language:
+- da
+tags:
+- ned
+- xlm-roberta
+- pytorch
+- transformers
+license: cc-by-sa-4.0
+datasets:
+- DaNED
+- DaWikiNED
+metrics:
+- f1
+---
+# XLM-Roberta fine-tuned for Named Entity Disambiguation
+Given a sentence and a knowledge graph context, the model detects whether a specific entity (represented by the knowledge graph context) is mentioned in the sentence (binary classification).
+The base language model used is the [xlm-roberta-base](https://huggingface.co/xlm-roberta-base).
+Here is how to use the model:
+```python
+from transformers import XLMRobertaTokenizer, XLMRobertaForSequenceClassification
+model = XLMRobertaForSequenceClassification.from_pretrained("DaNLP/da-xlmr-ned")
+tokenizer = XLMRobertaTokenizer.from_pretrained("DaNLP/da-xlmr-ned")
+```
+The tokenizer takes 2 strings has input: the sentence and the knowledge graph (KG) context.
+Here is an example:
+```python
+sentence = "Karen Blixen vendte tilbage til Danmark, hvor hun boede resten af sit liv på Rungstedlund, som hun arvede efter sin mor i 1939"
+kg_context = "udmærkelser modtaget Kritikerprisen udmærkelser modtaget Tagea Brandts Rejselegat udmærkelser modtaget Ingenio et arti udmærkelser modtaget Holbergmedaljen udmærkelser modtaget De Gyldne Laurbær mor Ingeborg Dinesen ægtefælle Bror von Blixen-Finecke køn kvinde Commons-kategori Karen Blixen LCAuth no95003722 VIAF 90663542 VIAF 121643918 GND-identifikator 118637878 ISNI 0000 0001 2096 6265 ISNI 0000 0003 6863 4408 ISNI 0000 0001 1891 0457 fødested Rungstedlund fødested Rungsted dødssted Rungstedlund dødssted København statsborgerskab Danmark NDL-nummer 00433530 dødsdato +1962-09-07T00:00:00Z dødsdato +1962-01-01T00:00:00Z fødselsdato +1885-04-17T00:00:00Z fødselsdato +1885-01-01T00:00:00Z AUT NKC jn20000600905 AUT NKC jo2015880827 AUT NKC xx0196181 emnets hovedkategori Kategori:Karen Blixen tilfælde af menneske billede Karen Blixen cropped from larger original.jpg IMDb-identifikationsnummer nm0227598 Freebase-ID /m/04ymd8w BNF 118857710 beskæftigelse skribent beskæftigelse selvbiograf beskæftigelse novelleforfatter ..."
+```
+A KG context, for a specific entity, can be generated from its Wikidata page.
+In the previous example, the KG context is a string representation of the Wikidata page of [Karen Blixen (QID=Q182804)](https://www.wikidata.org/wiki/Q182804).
+See the [DaNLP documentation](https://danlp-alexandra.readthedocs.io/en/latest/docs/tasks/ned.html#xlmr) for more details about how to generate a KG context.
+## Training Data
+The model has been trained on the [DaNED](https://danlp-alexandra.readthedocs.io/en/latest/docs/datasets.html#daned) and [DaWikiNED](https://danlp-alexandra.readthedocs.io/en/latest/docs/datasets.html#dawikined) datasets.

config.json ADDED Viewed

	@@ -0,0 +1,35 @@

+{
+  "_name_or_path": ".",
+  "architectures": [
+    "XLMRobertaForSequenceClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "bos_token_id": 0,
+  "eos_token_id": 2,
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "id2label": {
+    "0": "not mentioned",
+    "1": "mentioned"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "label2id": {
+    "mentioned": 1,
+    "not mentioned": 0
+  },
+  "layer_norm_eps": 1e-05,
+  "max_position_embeddings": 514,
+  "model_type": "xlm-roberta",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "output_past": true,
+  "pad_token_id": 1,
+  "position_embedding_type": "absolute",
+  "transformers_version": "4.9.2",
+  "type_vocab_size": 1,
+  "use_cache": true,
+  "vocab_size": 250002
+}

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:216f83dd3982d7a4674125737cca5999ccd95c61043fd144ac08572f4163c05a
+size 1112269212

sentencepiece.bpe.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cfc8146abe2a0488e9e2a0c56de7952f7c11ab059eca145a0a727afce0db2865
+size 5069051

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"bos_token": "<s>", "eos_token": "</s>", "unk_token": "<unk>", "sep_token": "</s>", "pad_token": "<pad>", "cls_token": "<s>", "mask_token": {"content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": false}}

tf_model.h5 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b0c5eb6ed9b89260cb34d1f39ca31447ac0a975c874c565f36b75bbdf0cb29ca
+size 1112454872

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"bos_token": "<s>", "eos_token": "</s>", "sep_token": "</s>", "cls_token": "<s>", "unk_token": "<unk>", "pad_token": "<pad>", "mask_token": {"content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "do_lower_case": false, "model_max_length": 512, "special_tokens_map_file": null, "name_or_path": "xlm-roberta-base"}