sarnoult commited on
Commit
80064d5
1 Parent(s): 4e9c04e

initial commit

Browse files
README.md ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: nl
3
+ license: apache-2.0
4
+ tags:
5
+ - dighum
6
+ inference: false
7
+ ---
8
+
9
+ # Early-modern Dutch NER (General Letters)
10
+
11
+ ## Description
12
+ This is a fine-tuned NER model for early-modern Dutch United East India Company (VOC) letters based on XLM-R_base [(Conneau et al., 2020)](https://aclanthology.org/2020.acl-main.747/). The model identifies *locations*, *persons*, *organisations*, but also *ships* as well as derived forms of locations and religions.
13
+
14
+ ## Intended uses and limitations
15
+
16
+ This model was fine-tuned (trained, validated and tested) on a single source of data, the General Letters (Generale Missiven). These letters span a large variety of Dutch, as they cover the largest part of the 17th and 18th centuries, and have been extended with editorial notes between 1960 and 2017. As the model was only fine-tuned on this data however, it may perform less well on other texts from the same period.
17
+
18
+
19
+ ## Training data and tagset
20
+
21
+ The model was fine-tuned on the General Letters [GM-NER](https://github.com/cltl/voc-missives/tree/master/data/ner/datasplit_all_standard) dataset, with the following tagset:
22
+
23
+ | tag | description | notes |
24
+ | --- | ----------- | ----- |
25
+ | LOC | locations | |
26
+ | LOCderiv | derived forms of locations | by derivation, e.g. *Bandanezen*, or composition, e.g. *Javakoffie* |
27
+ | ORG | organisations | includes forms derived by composition, e.g. *Compagnieszaken*
28
+ | PER | persons |
29
+ | RELderiv | forms related to religion | merges religion names (*Christendom*), derived forms (*christenen*) and composed forms (*Christen-orangkay*) |
30
+ | SHP | ships |
31
+
32
+ The base text for this dataset is OCR text that has been partially corrected. The text is clean overall but errors remain.
33
+
34
+ ## Training procedure
35
+ The model was fine-tuned with [xlm-roberta-base](https://huggingface.co/xlm-roberta-base), using [this script](https://github.com/huggingface/transformers/blob/master/examples/legacy/token-classification/run_ner.py).
36
+
37
+ Non-default training parameters are:
38
+ * training batch size: 16
39
+ * max sequence length: 256
40
+ * number of epochs: 4 -- loading the best checkpoint model by loss at the end, with checkpoints every 200 steps
41
+ * (seed: 1)
42
+
43
+
44
+ ## Evaluation
45
+ ### Metric
46
+ * entity-level F1
47
+
48
+ ### Results
49
+
50
+ | overall | 92.7 |
51
+ | --- | ----------- |
52
+ | LOC | 95.8 |
53
+ | LOCderiv | 92.7 |
54
+ | ORG | 92.5 |
55
+ | PER | 86.2 |
56
+ | RELderiv | 90.7 |
57
+ | SHP | 81.6 |
58
+
59
+
60
+ ## Authors and references
61
+ ### Authors
62
+ Sophie Arnoult, Lodewijk Petram and Piek Vossen
63
+
64
+ ### Reference
65
+ This model was fine-tuned as part of experiments for a paper accepted at [LaTeCH-CLfL 2021](https://sighum.wordpress.com/events/latech-clfl-2021/accepted-papers/): *Batavia asked for advice. Pretrained language models for Named Entity Recognition in historical texts.*
66
+
config.json ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "/archive/transformers-hub/gm-ner-xlmrbase",
3
+ "architectures": [
4
+ "XLMRobertaForTokenClassification"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 0,
8
+ "eos_token_id": 2,
9
+ "finetuning_task": "ner",
10
+ "gradient_checkpointing": false,
11
+ "hidden_act": "gelu",
12
+ "hidden_dropout_prob": 0.1,
13
+ "hidden_size": 768,
14
+ "id2label": {
15
+ "0": "B-LOC",
16
+ "1": "B-LOCderiv",
17
+ "2": "B-ORG",
18
+ "3": "B-PER",
19
+ "4": "B-RELderiv",
20
+ "5": "B-SHP",
21
+ "6": "I-LOC",
22
+ "7": "I-LOCderiv",
23
+ "8": "I-ORG",
24
+ "9": "I-PER",
25
+ "10": "I-SHP",
26
+ "11": "O"
27
+ },
28
+ "initializer_range": 0.02,
29
+ "intermediate_size": 3072,
30
+ "label2id": {
31
+ "B-LOC": 0,
32
+ "B-LOCderiv": 1,
33
+ "B-ORG": 2,
34
+ "B-PER": 3,
35
+ "B-RELderiv": 4,
36
+ "B-SHP": 5,
37
+ "I-LOC": 6,
38
+ "I-LOCderiv": 7,
39
+ "I-ORG": 8,
40
+ "I-PER": 9,
41
+ "I-SHP": 10,
42
+ "O": 11
43
+ },
44
+ "layer_norm_eps": 1e-05,
45
+ "max_position_embeddings": 514,
46
+ "model_type": "xlm-roberta",
47
+ "num_attention_heads": 12,
48
+ "num_hidden_layers": 12,
49
+ "output_past": true,
50
+ "pad_token_id": 1,
51
+ "position_embedding_type": "absolute",
52
+ "torch_dtype": "float32",
53
+ "transformers_version": "4.9.0.dev0",
54
+ "type_vocab_size": 1,
55
+ "use_cache": true,
56
+ "vocab_size": 250002
57
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4d524916636607f6f8c7a37b0ff10890fb43c68509fdb84ab3c1f8821d305c13
3
+ size 1109938871
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
1
+ {"bos_token": "<s>", "eos_token": "</s>", "unk_token": "<unk>", "sep_token": "</s>", "pad_token": "<pad>", "cls_token": "<s>", "mask_token": {"content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": false}}
tf_model.h5 ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8c17b309ac994b42d28cdca95c2bed88a8a7625faeebe35af7d3a81653a4087e
3
+ size 1110136976
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
1
+ {"bos_token": "<s>", "eos_token": "</s>", "sep_token": "</s>", "cls_token": "<s>", "unk_token": "<unk>", "pad_token": "<pad>", "mask_token": {"content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "model_max_length": 512, "special_tokens_map_file": null, "name_or_path": "xlm-roberta-base", "tokenizer_class": "XLMRobertaTokenizer"}