KennethEnevoldsen commited on
Commit
b2c4ef2
1 Parent(s): ad1f1f2

Update spaCy pipeline

Browse files
README.md CHANGED
@@ -1 +1,193 @@
1
- This is currently a placeholder model.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - spacy
4
+ - token-classification
5
+ language:
6
+ - da
7
+ license: Apache-2.0-License
8
+ model-index:
9
+ - name: da_dacy_small_trf
10
+ results:
11
+ - tasks:
12
+ name: NER
13
+ type: token-classification
14
+ metrics:
15
+ - name: Precision
16
+ type: precision
17
+ value: 0.81724846
18
+ - name: Recall
19
+ type: recall
20
+ value: 0.8291666667
21
+ - name: F Score
22
+ type: f_score
23
+ value: 0.8231644261
24
+ - tasks:
25
+ name: SENTER
26
+ type: token-classification
27
+ metrics:
28
+ - name: Precision
29
+ type: precision
30
+ value: 0.8603839442
31
+ - name: Recall
32
+ type: recall
33
+ value: 0.8741134752
34
+ - name: F Score
35
+ type: f_score
36
+ value: 0.8671943712
37
+ - tasks:
38
+ name: UNLABELED_DEPENDENCIES
39
+ type: token-classification
40
+ metrics:
41
+ - name: Accuracy
42
+ type: accuracy
43
+ value: 0.8492442546
44
+ - tasks:
45
+ name: LABELED_DEPENDENCIES
46
+ type: token-classification
47
+ metrics:
48
+ - name: Accuracy
49
+ type: accuracy
50
+ value: 0.8492442546
51
+ ---
52
+
53
+ <a href="https://github.com/centre-for-humanities-computing/Dacy"><img src="https://centre-for-humanities-computing.github.io/DaCy/_static/icon.png" width="175" height="175" align="right" /></a>
54
+
55
+ # DaCy small transformer
56
+
57
+ DaCy is a Danish language processing framework with state-of-the-art pipelines as well as functionality for analysing Danish pipelines.
58
+ DaCy's largest pipeline has achieved State-of-the-Art performance on Named entity recognition, part-of-speech tagging and dependency
59
+ parsing for Danish on the DaNE dataset. Check out the [DaCy repository](https://github.com/centre-for-humanities-computing/DaCy) for material on how to use DaCy and reproduce the results.
60
+ DaCy also contains guides on usage of the package as well as behavioural test for biases and robustness of Danish NLP pipelines.
61
+
62
+
63
+ | Feature | Description |
64
+ | --- | --- |
65
+ | **Name** | `da_dacy_small_trf` |
66
+ | **Version** | `0.1.0` |
67
+ | **spaCy** | `>=3.1.1,<3.2.0` |
68
+ | **Default Pipeline** | `transformer`, `morphologizer`, `parser`, `attribute_ruler`, `lemmatizer`, `ner` |
69
+ | **Components** | `transformer`, `morphologizer`, `parser`, `attribute_ruler`, `lemmatizer`, `ner` |
70
+ | **Vectors** | 0 keys, 0 unique vectors (0 dimensions) |
71
+ | **Sources** | [UD Danish DDT v2.5](https://github.com/UniversalDependencies/UD_Danish-DDT) (Johannsen, Anders; Martínez Alonso, Héctor; Plank, Barbara)<br />[DaNE](https://github.com/alexandrainst/danlp/blob/master/docs/datasets.md#danish-dependency-treebank-dane) (Rasmus Hvingelby, Amalie B. Pauli, Maria Barrett, Christina Rosted, Lasse M. Lidegaard, Anders Søgaard)<br />[Maltehb/-l-ctra-danish-electra-small-cased](https://huggingface.co/Maltehb/-l-ctra-danish-electra-small-cased) (Malte Højmark-Bertelsen) |
72
+ | **License** | `Apache-2.0 License` |
73
+ | **Author** | [Centre for Humanities Computing Aarhus](https://chcaa.io/#/) |
74
+
75
+ ### Label Scheme
76
+
77
+ <details>
78
+
79
+ <summary>View label scheme (192 labels for 3 components)</summary>
80
+
81
+ | Component | Labels |
82
+ | --- | --- |
83
+ | **`morphologizer`** | `AdpType=Prep\|POS=ADP`, `Definite=Ind\|Gender=Com\|Number=Sing\|POS=NOUN`, `Mood=Ind\|POS=AUX\|Tense=Pres\|VerbForm=Fin\|Voice=Act`, `POS=PROPN`, `Definite=Ind\|Number=Sing\|POS=VERB\|Tense=Past\|VerbForm=Part`, `Definite=Def\|Gender=Neut\|Number=Sing\|POS=NOUN`, `POS=SCONJ`, `Definite=Def\|Gender=Com\|Number=Sing\|POS=NOUN`, `Mood=Ind\|POS=VERB\|Tense=Pres\|VerbForm=Fin\|Voice=Act`, `POS=ADV`, `Number=Plur\|POS=DET\|PronType=Dem`, `Degree=Pos\|Number=Plur\|POS=ADJ`, `Definite=Ind\|Gender=Com\|Number=Plur\|POS=NOUN`, `POS=PUNCT`, `POS=CCONJ`, `Definite=Ind\|Degree=Cmp\|Number=Sing\|POS=ADJ`, `Degree=Cmp\|POS=ADJ`, `POS=PRON\|PartType=Inf`, `Gender=Com\|Number=Sing\|POS=DET\|PronType=Ind`, `Definite=Ind\|Degree=Pos\|Number=Sing\|POS=ADJ`, `Case=Acc\|Gender=Neut\|Number=Sing\|POS=PRON\|Person=3\|PronType=Prs`, `Definite=Ind\|Gender=Neut\|Number=Plur\|POS=NOUN`, `Definite=Def\|Degree=Pos\|Number=Sing\|POS=ADJ`, `Gender=Neut\|Number=Sing\|POS=DET\|PronType=Dem`, `Degree=Pos\|POS=ADV`, `Definite=Def\|Number=Sing\|POS=VERB\|Tense=Past\|VerbForm=Part`, `Definite=Ind\|Gender=Neut\|Number=Sing\|POS=NOUN`, `POS=PRON\|PronType=Dem`, `NumType=Card\|POS=NUM`, `Definite=Ind\|Degree=Pos\|Gender=Neut\|Number=Sing\|POS=ADJ`, `Case=Acc\|Gender=Com\|Number=Sing\|POS=PRON\|Person=3\|PronType=Prs`, `Degree=Pos\|Gender=Com\|Number=Sing\|POS=ADJ`, `Case=Nom\|Gender=Com\|Number=Sing\|POS=PRON\|Person=3\|PronType=Prs`, `NumType=Ord\|POS=ADJ`, `Gender=Com\|Number=Sing\|Number[psor]=Sing\|POS=DET\|Person=3\|Poss=Yes\|PronType=Prs\|Reflex=Yes`, `Mood=Ind\|POS=AUX\|Tense=Past\|VerbForm=Fin\|Voice=Act`, `POS=VERB\|VerbForm=Inf\|Voice=Act`, `Mood=Ind\|POS=VERB\|Tense=Past\|VerbForm=Fin\|Voice=Act`, `POS=NOUN`, `Mood=Ind\|POS=VERB\|Tense=Pres\|VerbForm=Fin\|Voice=Pass`, `POS=ADP\|PartType=Inf`, `Degree=Pos\|POS=ADJ`, `Definite=Def\|Gender=Com\|Number=Plur\|POS=NOUN`, `Number[psor]=Sing\|POS=DET\|Person=3\|Poss=Yes\|PronType=Prs`, `Case=Gen\|Definite=Def\|Gender=Com\|Number=Sing\|POS=NOUN`, `POS=AUX\|VerbForm=Inf\|Voice=Act`, `Definite=Ind\|Degree=Pos\|Gender=Com\|Number=Sing\|POS=ADJ`, `Gender=Com\|Number=Sing\|POS=DET\|PronType=Dem`, `Number=Plur\|POS=DET\|PronType=Ind`, `Gender=Com\|Number=Sing\|POS=PRON\|PronType=Ind`, `Case=Acc\|POS=PRON\|Person=3\|PronType=Prs\|Reflex=Yes`, `POS=PART\|PartType=Inf`, `Gender=Neut\|Number=Sing\|POS=DET\|PronType=Ind`, `Case=Acc\|Number=Plur\|POS=PRON\|Person=3\|PronType=Prs`, `Case=Gen\|Definite=Def\|Gender=Neut\|Number=Sing\|POS=NOUN`, `Case=Nom\|Number=Plur\|POS=PRON\|Person=3\|PronType=Prs`, `Case=Nom\|Gender=Com\|Number=Sing\|POS=PRON\|Person=1\|PronType=Prs`, `Case=Nom\|Gender=Com\|POS=PRON\|PronType=Ind`, `Gender=Neut\|Number=Sing\|POS=PRON\|PronType=Ind`, `Mood=Imp\|POS=VERB`, `Gender=Com\|Number=Sing\|Number[psor]=Sing\|POS=DET\|Person=1\|Poss=Yes\|PronType=Prs`, `Definite=Ind\|Number=Sing\|POS=AUX\|Tense=Past\|VerbForm=Part`, `POS=X`, `Case=Nom\|Gender=Com\|Number=Plur\|POS=PRON\|Person=1\|PronType=Prs`, `Case=Gen\|Definite=Def\|Gender=Com\|Number=Plur\|POS=NOUN`, `POS=VERB\|Tense=Pres\|VerbForm=Part`, `Number=Plur\|POS=PRON\|PronType=Int,Rel`, `POS=VERB\|VerbForm=Inf\|Voice=Pass`, `Case=Gen\|Definite=Ind\|Gender=Com\|Number=Sing\|POS=NOUN`, `Degree=Cmp\|POS=ADV`, `POS=ADV\|PartType=Inf`, `Degree=Sup\|POS=ADV`, `Number=Plur\|POS=PRON\|PronType=Dem`, `Number=Plur\|POS=PRON\|PronType=Ind`, `Definite=Def\|Gender=Neut\|Number=Plur\|POS=NOUN`, `Case=Acc\|Gender=Com\|Number=Sing\|POS=PRON\|Person=1\|PronType=Prs`, `Case=Gen\|POS=PROPN`, `POS=ADP`, `Degree=Cmp\|Number=Plur\|POS=ADJ`, `Definite=Def\|Degree=Sup\|POS=ADJ`, `Gender=Neut\|Number=Sing\|Number[psor]=Sing\|POS=DET\|Person=1\|Poss=Yes\|PronType=Prs`, `Degree=Pos\|Number=Sing\|POS=ADJ`, `Number=Plur\|Number[psor]=Sing\|POS=DET\|Person=3\|Poss=Yes\|PronType=Prs\|Reflex=Yes`, `Gender=Com\|Number=Sing\|Number[psor]=Plur\|POS=DET\|Person=1\|Poss=Yes\|PronType=Prs\|Style=Form`, `Number=Plur\|POS=PRON\|PronType=Rcp`, `Case=Gen\|Degree=Cmp\|POS=ADJ`, `Case=Gen\|Definite=Def\|Gender=Neut\|Number=Plur\|POS=NOUN`, `Number[psor]=Plur\|POS=DET\|Person=3\|Poss=Yes\|PronType=Prs`, `POS=INTJ`, `Number=Plur\|Number[psor]=Sing\|POS=DET\|Person=1\|Poss=Yes\|PronType=Prs`, `Degree=Pos\|Gender=Neut\|Number=Sing\|POS=ADJ`, `Gender=Neut\|Number=Sing\|Number[psor]=Plur\|POS=DET\|Person=1\|Poss=Yes\|PronType=Prs\|Style=Form`, `Case=Acc\|Gender=Com\|Number=Sing\|POS=PRON\|Person=2\|PronType=Prs`, `Gender=Com\|Number=Sing\|Number[psor]=Sing\|POS=DET\|Person=2\|Poss=Yes\|PronType=Prs`, `Case=Gen\|Definite=Ind\|Gender=Neut\|Number=Plur\|POS=NOUN`, `Number=Sing\|POS=PRON\|PronType=Int,Rel`, `Number=Plur\|Number[psor]=Plur\|POS=DET\|Person=1\|Poss=Yes\|PronType=Prs\|Style=Form`, `Gender=Neut\|Number=Sing\|POS=PRON\|PronType=Int,Rel`, `Definite=Def\|Degree=Sup\|Number=Plur\|POS=ADJ`, `Case=Nom\|Gender=Com\|Number=Sing\|POS=PRON\|Person=2\|PronType=Prs`, `Gender=Neut\|Number=Sing\|Number[psor]=Sing\|POS=DET\|Person=3\|Poss=Yes\|PronType=Prs\|Reflex=Yes`, `Definite=Ind\|Number=Sing\|POS=NOUN`, `Number=Plur\|POS=VERB\|Tense=Past\|VerbForm=Part`, `Number=Plur\|Number[psor]=Sing\|POS=PRON\|Person=3\|Poss=Yes\|PronType=Prs\|Reflex=Yes`, `POS=SYM`, `Case=Nom\|Gender=Com\|POS=PRON\|Person=2\|Polite=Form\|PronType=Prs`, `Degree=Sup\|POS=ADJ`, `Number=Plur\|POS=DET\|PronType=Ind\|Style=Arch`, `Case=Gen\|Gender=Com\|Number=Sing\|POS=DET\|PronType=Dem`, `Foreign=Yes\|POS=X`, `POS=DET\|Person=2\|Polite=Form\|Poss=Yes\|PronType=Prs`, `Gender=Neut\|Number=Sing\|POS=PRON\|PronType=Dem`, `Case=Acc\|Gender=Com\|Number=Plur\|POS=PRON\|Person=1\|PronType=Prs`, `Case=Gen\|Definite=Ind\|Gender=Neut\|Number=Sing\|POS=NOUN`, `Case=Gen\|POS=PRON\|PronType=Int,Rel`, `Gender=Com\|Number=Sing\|POS=PRON\|PronType=Dem`, `Abbr=Yes\|POS=X`, `Case=Gen\|Definite=Ind\|Gender=Com\|Number=Plur\|POS=NOUN`, `Definite=Def\|Degree=Abs\|POS=ADJ`, `Definite=Ind\|Degree=Sup\|Number=Sing\|POS=ADJ`, `Definite=Ind\|POS=NOUN`, `Gender=Com\|Number=Plur\|POS=NOUN`, `Number[psor]=Plur\|POS=DET\|Person=1\|Poss=Yes\|PronType=Prs`, `Gender=Com\|POS=PRON\|PronType=Int,Rel`, `Case=Nom\|Gender=Com\|Number=Plur\|POS=PRON\|Person=2\|PronType=Prs`, `Degree=Abs\|POS=ADV`, `POS=VERB\|VerbForm=Ger`, `POS=VERB\|Tense=Past\|VerbForm=Part`, `Definite=Def\|Degree=Sup\|Number=Sing\|POS=ADJ`, `Number=Plur\|Number[psor]=Plur\|POS=PRON\|Person=1\|Poss=Yes\|PronType=Prs\|Style=Form`, `Case=Gen\|Definite=Def\|Degree=Pos\|Number=Sing\|POS=ADJ`, `Case=Gen\|Degree=Pos\|Number=Plur\|POS=ADJ`, `Case=Acc\|Gender=Com\|POS=PRON\|Person=2\|Polite=Form\|PronType=Prs`, `Gender=Com\|Number=Sing\|POS=PRON\|PronType=Int,Rel`, `POS=VERB\|Tense=Pres`, `Case=Gen\|Number=Plur\|POS=DET\|PronType=Ind`, `Number[psor]=Plur\|POS=DET\|Person=2\|Poss=Yes\|PronType=Prs`, `POS=PRON\|Person=2\|Polite=Form\|Poss=Yes\|PronType=Prs`, `Gender=Neut\|Number=Sing\|Number[psor]=Sing\|POS=DET\|Person=2\|Poss=Yes\|PronType=Prs`, `POS=AUX\|Tense=Pres\|VerbForm=Part`, `Mood=Ind\|POS=VERB\|Tense=Past\|VerbForm=Fin\|Voice=Pass`, `Gender=Com\|Number=Sing\|Number[psor]=Sing\|POS=PRON\|Person=3\|Poss=Yes\|PronType=Prs\|Reflex=Yes`, `Degree=Sup\|Number=Plur\|POS=ADJ`, `Case=Acc\|Gender=Com\|Number=Plur\|POS=PRON\|Person=2\|PronType=Prs`, `Gender=Neut\|Number=Sing\|Number[psor]=Sing\|POS=PRON\|Person=3\|Poss=Yes\|PronType=Prs\|Reflex=Yes`, `Definite=Ind\|Number=Plur\|POS=NOUN`, `Case=Gen\|Number=Plur\|POS=VERB\|Tense=Past\|VerbForm=Part`, `Mood=Imp\|POS=AUX`, `Gender=Com\|Number=Sing\|Number[psor]=Sing\|POS=PRON\|Person=1\|Poss=Yes\|PronType=Prs`, `Number[psor]=Sing\|POS=PRON\|Person=3\|Poss=Yes\|PronType=Prs`, `Definite=Def\|Gender=Com\|Number=Sing\|POS=VERB\|Tense=Past\|VerbForm=Part`, `Number=Plur\|Number[psor]=Sing\|POS=DET\|Person=2\|Poss=Yes\|PronType=Prs`, `Case=Gen\|Gender=Com\|Number=Sing\|POS=DET\|PronType=Ind`, `Case=Gen\|POS=NOUN`, `Number[psor]=Plur\|POS=PRON\|Person=3\|Poss=Yes\|PronType=Prs`, `POS=DET\|PronType=Dem`, `Definite=Def\|Number=Plur\|POS=NOUN` |
84
+ | **`parser`** | `ROOT`, `acl:relcl`, `advcl`, `advmod`, `amod`, `appos`, `aux`, `case`, `cc`, `ccomp`, `compound:prt`, `conj`, `cop`, `dep`, `det`, `expl`, `fixed`, `flat`, `iobj`, `list`, `mark`, `nmod`, `nmod:poss`, `nsubj`, `nummod`, `obj`, `obl`, `obl:loc`, `obl:tmod`, `punct`, `xcomp` |
85
+ | **`ner`** | `LOC`, `MISC`, `ORG`, `PER` |
86
+
87
+ </details>
88
+
89
+ ### Accuracy
90
+
91
+ | Type | Score |
92
+ | --- | --- |
93
+ | `POS_ACC` | 95.83 |
94
+ | `MORPH_ACC` | 95.70 |
95
+ | `DEP_UAS` | 84.92 |
96
+ | `DEP_LAS` | 81.76 |
97
+ | `SENTS_P` | 86.04 |
98
+ | `SENTS_R` | 87.41 |
99
+ | `SENTS_F` | 86.72 |
100
+ | `LEMMA_ACC` | 84.91 |
101
+ | `ENTS_F` | 82.32 |
102
+ | `ENTS_P` | 81.72 |
103
+ | `ENTS_R` | 82.92 |
104
+ | `TRANSFORMER_LOSS` | 41746686.63 |
105
+ | `MORPHOLOGIZER_LOSS` | 3458966.49 |
106
+ | `PARSER_LOSS` | 15104898.38 |
107
+ | `NER_LOSS` | 546098.45 |
108
+
109
+
110
+ ## Bias and Robustness
111
+
112
+ Besides the validation done by SpaCy on the DaNE testset, DaCy also provides a series of augmentations to the DaNE test set to see how well the models deal with these types of augmentations.
113
+ The can be seen as behavioural probes akinn to the NLP checklist.
114
+
115
+ ### Deterministic Augmentations
116
+ Deterministic augmentations are augmentation which always yield the same result.
117
+
118
+ | Augmentation | Part-of-speech tagging (Accuracy) | Morphological tagging (Accuracy) | Dependency Parsing (UAS) | Dependency Parsing (LAS) | Sentence segmentation (F1) | Lemmatization (Accuracy) | Named entity recognition (F1) |
119
+ | --- | --- | --- | --- | --- | --- | --- | --- |
120
+ | No augmentation | 0.98 | 0.974 | 0.868 | 0.836 | 0.936 | 0.844 | 0.765 |
121
+ | Æøå Augmentation | 0.955 | 0.948 | 0.823 | 0.783 | 0.922 | 0.754 | 0.718 |
122
+ | Lowercase | 0.974 | 0.97 | 0.862 | 0.828 | 0.905 | 0.848 | 0.681 |
123
+ | No Spacing | 0.229 | 0.229 | 0.004 | 0.003 | 0.824 | 0.225 | 0.048 |
124
+ | Abbreviated first names | 0.979 | 0.973 | 0.864 | 0.832 | 0.94 | 0.845 | 0.699 |
125
+ | Input size augmentation 5 sentences | 0.956 | 0.956 | 0.851 | 0.818 | 0.883 | 0.844 | 0.743 |
126
+ | Input size augmentation 10 sentences | 0.959 | 0.958 | 0.853 | 0.821 | 0.897 | 0.844 | 0.755 |
127
+
128
+
129
+
130
+ ### Stochastic Augmentations
131
+ Stochastic augmentations are augmentation which are repeated mulitple times to estimate the effect of the augmentation.
132
+
133
+ | Augmentation | Part-of-speech tagging (Accuracy) | Morphological tagging (Accuracy) | Dependency Parsing (UAS) | Dependency Parsing (LAS) | Sentence segmentation (F1) | Lemmatization (Accuracy) | Named entity recognition (F1) |
134
+ | --- | --- | --- | --- | --- | --- | --- | --- |
135
+ | Keystroke errors 2% | 0.931 (0.003) | 0.929 (0.003) | 0.797 (0.003) | 0.753 (0.003) | 0.884 (0.003) | 0.772 (0.003) | 0.657 (0.003) |
136
+ | Keystroke errors 5% | 0.859 (0.003) | 0.863 (0.003) | 0.699 (0.003) | 0.641 (0.003) | 0.824 (0.003) | 0.681 (0.003) | 0.53 (0.003) |
137
+ | Keystroke errors 15% | 0.633 (0.006) | 0.662 (0.006) | 0.439 (0.006) | 0.358 (0.006) | 0.688 (0.006) | 0.459 (0.006) | 0.293 (0.006) |
138
+ | Danish names | 0.979 (0.0) | 0.974 (0.0) | 0.867 (0.0) | 0.835 (0.0) | 0.943 (0.0) | 0.847 (0.0) | 0.748 (0.0) |
139
+ | Muslim names | 0.979 (0.0) | 0.974 (0.0) | 0.865 (0.0) | 0.833 (0.0) | 0.94 (0.0) | 0.847 (0.0) | 0.732 (0.0) |
140
+ | Female names | 0.979 (0.0) | 0.974 (0.0) | 0.867 (0.0) | 0.835 (0.0) | 0.946 (0.0) | 0.847 (0.0) | 0.754 (0.0) |
141
+ | Male names | 0.979 (0.0) | 0.974 (0.0) | 0.867 (0.0) | 0.835 (0.0) | 0.943 (0.0) | 0.847 (0.0) | 0.748 (0.0) |
142
+ | Spacing Augmention 5% | 0.941 (0.002) | 0.936 (0.002) | 0.755 (0.002) | 0.725 (0.002) | 0.907 (0.002) | 0.811 (0.002) | 0.699 (0.002) |
143
+
144
+ <details>
145
+
146
+ <summary> Description of Augmenters </summary>
147
+
148
+
149
+
150
+ **No augmentation:**
151
+ Applies no augmentation to the DaNE test set.
152
+
153
+ **Æøå Augmentation:**
154
+ This augmentation replace the æ,ø, and å with their spelling variations ae, oe and aa respectively.
155
+
156
+ **Lowercase:**
157
+ This augmentation lowercases all text.
158
+
159
+ **No Spacing:**
160
+ This augmentation removed all spacing from the text.
161
+
162
+ **Abbreviated first names:**
163
+ This agmentation abbreviates the first names of entities. For instance 'Kenneth Enevoldsen' would turn to 'K. Enevoldsen'.
164
+
165
+ **Keystroke errors 2%:**
166
+ This agmentation simulate keystroke errors by replacing 2% of keys with a neighbouring key on a Danish QWERTY keyboard. As this agmentation is stochastic it is repeated 20 times to obtain a consistent estimate and the mean is provided with its standard deviation in parenthesis.
167
+
168
+ **Keystroke errors 5%:**
169
+ This agmentation simulate keystroke errors by replacing 5% of keys with a neighbouring key on a Danish QWERTY keyboard. As this agmentation is stochastic it is repeated 20 times to obtain a consistent estimate and the mean is provided with its standard deviation in parenthesis.
170
+
171
+ **Keystroke errors 15%:**
172
+ This agmentation simulate keystroke errors by replacing 15% of keys with a neighbouring key on a Danish QWERTY keyboard. As this agmentation is stochastic it is repeated 20 times to obtain a consistent estimate and the mean is provided with its standard deviation in parenthesis.
173
+
174
+ **Danish names:**
175
+ This agmentation replace all names with Danish names derived from Danmarks Statistik (2021). As this agmentation is stochastic it is repeated 20 times to obtain a consistent estimate and the mean is provided with its standard deviation in parenthesis.
176
+
177
+ **Muslim names:**
178
+ This agmentation replace all names with Muslim names derived from Meldgaard (2005). As this agmentation is stochastic it is repeated 20 times to obtain a consistent estimate and the mean is provided with its standard deviation in parenthesis.
179
+
180
+ **Female names:**
181
+ This agmentation replace all names with Danish female names derived from Danmarks Statistik (2021). As this agmentation is stochastic it is repeated 20 times to obtain a consistent estimate and the mean is provided with its standard deviation in parenthesis.
182
+
183
+ **Male names:**
184
+ This agmentation replace all names with Danish male names derived from Danmarks Statistik (2021). As this agmentation is stochastic it is repeated 20 times to obtain a consistent estimate and the mean is provided with its standard deviation in parenthesis.
185
+
186
+ **Spacing Augmention 5%:**
187
+ This agmentation replace all names with Danish male names derived from Danmarks Statistik (2021). As this agmentation is stochastic it is repeated 20 times to obtain a consistent estimate and the mean is provided with its standard deviation in parenthesis.
188
+ </details>
189
+ <br />
190
+
191
+
192
+ ### Hardware
193
+ This was run an trained on a Quadro RTX 8000 GPU.
config.cfg CHANGED
@@ -104,7 +104,6 @@ stride = 96
104
 
105
  [components.transformer.model.tokenizer_config]
106
  use_fast = true
107
- strip_accents = false
108
 
109
  [corpora]
110
 
@@ -136,7 +135,7 @@ dropout = 0.1
136
  accumulate_gradient = 3
137
  patience = 5000
138
  max_epochs = 0
139
- max_steps = 1
140
  eval_frequency = 1000
141
  frozen_components = []
142
  before_to_disk = null
 
104
 
105
  [components.transformer.model.tokenizer_config]
106
  use_fast = true
 
107
 
108
  [corpora]
109
 
 
135
  accumulate_gradient = 3
136
  patience = 5000
137
  max_epochs = 0
138
+ max_steps = 40000
139
  eval_frequency = 1000
140
  frozen_components = []
141
  before_to_disk = null
da_dacy_small_trf-any-py3-none-any.whl CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:58343f2fc5b2d62c3863e843dcf9987afabad285ae19256b7dbe8acb7dd6df2d
3
- size 57359279
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9a76f9af63a196fccfc13b6dab46ef46ac1ba1202c15ad38b7189b07ee6e62be
3
+ size 57514565
meta.json CHANGED
@@ -2,13 +2,13 @@
2
  "lang":"da",
3
  "name":"dacy_small_trf",
4
  "version":"0.1.0",
5
- "description":"DaCy is a Danish language processing framework with state-of-the-art pipelines as well as functionality for analysing Danish pipelines. DaCy's largest pipeline has achieved State-of-the-Art performance on Named entity recognition, part-of-speech tagging and dependency parsing for Danish on the DaNE dataset. Check out the [DaCy repository](https://github.com/centre-for-humanities-computing/DaCy) for material on how to use DaCy and reproduce the results. This repository also contains guides on usage of the package as well as behavioural test for biases and robustness of Danish NLP pipelines.",
6
- "author":"Kenneth Enevoldsen",
7
- "email":"kenneth.enevoldsen@cas.au.dk",
8
- "url":"https://centre-for-humanities-computing.github.io/DaCy/",
9
  "license":"Apache-2.0 License",
10
- "spacy_version":">=3.1.0,<3.2.0",
11
- "spacy_git_version":"530b5d72f",
12
  "vectors":{
13
  "width":0,
14
  "vectors":0,
@@ -243,248 +243,251 @@
243
  "disabled":[
244
 
245
  ],
 
 
 
246
  "performance":{
247
- "pos_acc":0.1150828248,
248
- "morph_acc":0.1095611741,
249
  "morph_per_feat":{
250
  "Mood":{
251
- "p":0.0,
252
- "r":0.0,
253
- "f":0.0
254
  },
255
  "Tense":{
256
- "p":0.0,
257
- "r":0.0,
258
- "f":0.0
259
  },
260
  "VerbForm":{
261
- "p":0.0,
262
- "r":0.0,
263
- "f":0.0
264
  },
265
  "Voice":{
266
- "p":0.0,
267
- "r":0.0,
268
- "f":0.0
269
  },
270
  "Definite":{
271
- "p":0.0,
272
- "r":0.0,
273
- "f":0.0
274
  },
275
  "Gender":{
276
- "p":0.0,
277
- "r":0.0,
278
- "f":0.0
279
  },
280
  "Number":{
281
- "p":0.0,
282
- "r":0.0,
283
- "f":0.0
284
  },
285
  "AdpType":{
286
- "p":0.1556564822,
287
- "r":1.0,
288
- "f":0.2693819221
289
  },
290
  "PartType":{
291
- "p":0.0,
292
- "r":0.0,
293
- "f":0.0
294
  },
295
  "Case":{
296
- "p":0.0,
297
- "r":0.0,
298
- "f":0.0
299
  },
300
  "Person":{
301
- "p":0.0,
302
- "r":0.0,
303
- "f":0.0
304
  },
305
  "PronType":{
306
- "p":0.0,
307
- "r":0.0,
308
- "f":0.0
309
  },
310
  "NumType":{
311
- "p":0.0,
312
- "r":0.0,
313
- "f":0.0
314
  },
315
  "Degree":{
316
- "p":0.0,
317
- "r":0.0,
318
- "f":0.0
319
  },
320
  "Reflex":{
321
- "p":0.0,
322
- "r":0.0,
323
- "f":0.0
324
  },
325
  "Number[psor]":{
326
- "p":0.0,
327
- "r":0.0,
328
- "f":0.0
329
  },
330
  "Poss":{
331
- "p":0.0,
332
- "r":0.0,
333
- "f":0.0
334
  },
335
  "Foreign":{
336
- "p":0.0,
337
- "r":0.0,
338
- "f":0.0
339
  },
340
  "Abbr":{
341
- "p":0.0,
342
- "r":0.0,
343
- "f":0.0
344
  },
345
  "Style":{
346
- "p":0.0,
347
- "r":0.0,
348
- "f":0.0
349
  },
350
  "Polite":{
351
- "p":0.0,
352
- "r":0.0,
353
- "f":0.0
354
  }
355
  },
356
- "dep_uas":0.1536466438,
357
- "dep_las":0.0261424348,
358
  "dep_las_per_type":{
359
  "advmod":{
360
- "p":0.0,
361
- "r":0.0,
362
- "f":0.0
363
  },
364
  "root":{
365
- "p":0.0578034682,
366
- "r":0.2659574468,
367
- "f":0.0949667616
368
  },
369
  "nsubj":{
370
- "p":0.0226640159,
371
- "r":0.0601265823,
372
- "f":0.032919434
373
  },
374
  "case":{
375
- "p":0.0623268698,
376
- "r":0.0444664032,
377
- "f":0.0519031142
378
  },
379
  "obl":{
380
- "p":0.0,
381
- "r":0.0,
382
- "f":0.0
383
  },
384
  "cc":{
385
- "p":0.0,
386
- "r":0.0,
387
- "f":0.0
388
  },
389
  "conj":{
390
- "p":0.0,
391
- "r":0.0,
392
- "f":0.0
393
  },
394
  "obj":{
395
- "p":0.0,
396
- "r":0.0,
397
- "f":0.0
398
  },
399
  "aux":{
400
- "p":0.0,
401
- "r":0.0,
402
- "f":0.0
403
  },
404
  "acl:relcl":{
405
- "p":0.0,
406
- "r":0.0,
407
- "f":0.0
408
  },
409
  "obl:loc":{
410
- "p":0.0,
411
- "r":0.0,
412
- "f":0.0
413
  },
414
  "det":{
415
- "p":0.0,
416
- "r":0.0,
417
- "f":0.0
418
  },
419
  "amod":{
420
- "p":0.0,
421
- "r":0.0,
422
- "f":0.0
423
  },
424
  "nmod:poss":{
425
- "p":0.0,
426
- "r":0.0,
427
- "f":0.0
428
  },
429
  "ccomp":{
430
- "p":0.0,
431
- "r":0.0,
432
- "f":0.0
433
  },
434
  "nummod":{
435
- "p":0.0,
436
- "r":0.0,
437
- "f":0.0
438
  },
439
  "flat":{
440
- "p":0.0,
441
- "r":0.0,
442
- "f":0.0
443
  },
444
  "compound:prt":{
445
- "p":0.0,
446
- "r":0.0,
447
- "f":0.0
448
  },
449
  "advcl":{
450
- "p":0.0,
451
- "r":0.0,
452
- "f":0.0
453
  },
454
  "mark":{
455
- "p":0.0,
456
- "r":0.0,
457
- "f":0.0
458
  },
459
  "cop":{
460
- "p":0.0,
461
- "r":0.0,
462
- "f":0.0
463
  },
464
  "dep":{
465
- "p":0.0,
466
- "r":0.0,
467
- "f":0.0
468
  },
469
  "nmod":{
470
- "p":0.0,
471
- "r":0.0,
472
- "f":0.0
473
  },
474
  "iobj":{
475
- "p":0.0,
476
- "r":0.0,
477
- "f":0.0
478
- },
479
- "list":{
480
- "p":0.0,
481
- "r":0.0,
482
- "f":0.0
483
  },
484
  "xcomp":{
485
- "p":0.0,
486
- "r":0.0,
487
- "f":0.0
 
 
 
 
 
488
  },
489
  "vocative":{
490
  "p":0.0,
@@ -492,24 +495,24 @@
492
  "f":0.0
493
  },
494
  "fixed":{
495
- "p":0.0,
496
- "r":0.0,
497
- "f":0.0
498
- },
499
- "appos":{
500
- "p":0.0,
501
- "r":0.0,
502
- "f":0.0
503
  },
504
  "expl":{
505
- "p":0.0,
506
- "r":0.0,
507
- "f":0.0
 
 
 
 
 
508
  },
509
  "obl:tmod":{
510
- "p":0.0,
511
- "r":0.0,
512
- "f":0.0
513
  },
514
  "discourse":{
515
  "p":0.0,
@@ -517,39 +520,62 @@
517
  "f":0.0
518
  }
519
  },
520
- "sents_p":0.0007698229,
521
- "sents_r":0.0035460993,
522
- "sents_f":0.0012650221,
523
  "lemma_acc":0.8491041162,
524
- "ents_f":0.0076157001,
525
- "ents_p":0.0040957782,
526
- "ents_r":0.0541666667,
527
  "ents_per_type":{
528
- "ORG":{
529
- "p":0.0040957782,
530
- "r":0.2888888889,
531
- "f":0.0080770426
532
- },
533
  "PER":{
534
- "p":0.0,
535
- "r":0.0,
536
- "f":0.0
 
 
 
 
 
537
  },
538
  "MISC":{
539
- "p":0.0,
540
- "r":0.0,
541
- "f":0.0
542
  },
543
  "LOC":{
544
- "p":0.0,
545
- "r":0.0,
546
- "f":0.0
547
  }
548
  },
549
- "transformer_loss":0.0,
550
- "morphologizer_loss":274.2421875,
551
- "parser_loss":424.0338255167,
552
- "ner_loss":262.4495080709
553
  },
554
- "notes":"This is a test"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
555
  }
 
2
  "lang":"da",
3
  "name":"dacy_small_trf",
4
  "version":"0.1.0",
5
+ "description":"\n<a href=\"https://github.com/centre-for-humanities-computing/Dacy\"><img src=\"https://centre-for-humanities-computing.github.io/DaCy/_static/icon.png\" width=\"175\" height=\"175\" align=\"right\" /></a>\n\n# DaCy small transformer\n\nDaCy is a Danish language processing framework with state-of-the-art pipelines as well as functionality for analysing Danish pipelines.\nDaCy's largest pipeline has achieved State-of-the-Art performance on Named entity recognition, part-of-speech tagging and dependency \nparsing for Danish on the DaNE dataset. Check out the [DaCy repository](https://github.com/centre-for-humanities-computing/DaCy) for material on how to use DaCy and reproduce the results. \nDaCy also contains guides on usage of the package as well as behavioural test for biases and robustness of Danish NLP pipelines.\n ",
6
+ "author":"Centre for Humanities Computing Aarhus",
7
+ "email":"Kenneth.enevoldsen@cas.au.dk",
8
+ "url":"https://chcaa.io/#/",
9
  "license":"Apache-2.0 License",
10
+ "spacy_version":">=3.1.1,<3.2.0",
11
+ "spacy_git_version":"ffaead8fe",
12
  "vectors":{
13
  "width":0,
14
  "vectors":0,
 
243
  "disabled":[
244
 
245
  ],
246
+ "_sourced_vectors_hashes":{
247
+
248
+ },
249
  "performance":{
250
+ "pos_acc":0.9583030655,
251
+ "morph_acc":0.9570439246,
252
  "morph_per_feat":{
253
  "Mood":{
254
+ "p":0.9950690335,
255
+ "r":0.9618684461,
256
+ "f":0.9781871062
257
  },
258
  "Tense":{
259
+ "p":0.9859922179,
260
+ "r":0.9540662651,
261
+ "f":0.9697665519
262
  },
263
  "VerbForm":{
264
+ "p":0.9823343849,
265
+ "r":0.952876377,
266
+ "f":0.9673811743
267
  },
268
  "Voice":{
269
+ "p":0.9938414165,
270
+ "r":0.9648729447,
271
+ "f":0.9791429655
272
  },
273
  "Definite":{
274
+ "p":0.9872480461,
275
+ "r":0.9482418017,
276
+ "f":0.9673518742
277
  },
278
  "Gender":{
279
+ "p":0.9793956044,
280
+ "r":0.9478231971,
281
+ "f":0.9633507853
282
  },
283
  "Number":{
284
+ "p":0.985179197,
285
+ "r":0.9535732916,
286
+ "f":0.9691186216
287
  },
288
  "AdpType":{
289
+ "p":1.0,
290
+ "r":0.9752431477,
291
+ "f":0.9874664279
292
  },
293
  "PartType":{
294
+ "p":1.0,
295
+ "r":0.9675324675,
296
+ "f":0.9834983498
297
  },
298
  "Case":{
299
+ "p":0.9934640523,
300
+ "r":0.9605055292,
301
+ "f":0.9767068273
302
  },
303
  "Person":{
304
+ "p":0.9908925319,
305
+ "r":0.9662522202,
306
+ "f":0.9784172662
307
  },
308
  "PronType":{
309
+ "p":0.9941077441,
310
+ "r":0.9712171053,
311
+ "f":0.9825291181
312
  },
313
  "NumType":{
314
+ "p":0.9791666667,
315
+ "r":0.9337748344,
316
+ "f":0.9559322034
317
  },
318
  "Degree":{
319
+ "p":0.9726708075,
320
+ "r":0.943373494,
321
+ "f":0.9577981651
322
  },
323
  "Reflex":{
324
+ "p":1.0,
325
+ "r":1.0,
326
+ "f":1.0
327
  },
328
  "Number[psor]":{
329
+ "p":1.0,
330
+ "r":0.988372093,
331
+ "f":0.9941520468
332
  },
333
  "Poss":{
334
+ "p":1.0,
335
+ "r":0.9772727273,
336
+ "f":0.9885057471
337
  },
338
  "Foreign":{
339
+ "p":0.8888888889,
340
+ "r":0.8,
341
+ "f":0.8421052632
342
  },
343
  "Abbr":{
344
+ "p":1.0,
345
+ "r":0.4,
346
+ "f":0.5714285714
347
  },
348
  "Style":{
349
+ "p":1.0,
350
+ "r":1.0,
351
+ "f":1.0
352
  },
353
  "Polite":{
354
+ "p":0.3333333333,
355
+ "r":0.25,
356
+ "f":0.2857142857
357
  }
358
  },
359
+ "dep_uas":0.8492442546,
360
+ "dep_las":0.8176199573,
361
  "dep_las_per_type":{
362
  "advmod":{
363
+ "p":0.7724637681,
364
+ "r":0.7528248588,
365
+ "f":0.7625178827
366
  },
367
  "root":{
368
+ "p":0.8561403509,
369
+ "r":0.865248227,
370
+ "f":0.860670194
371
  },
372
  "nsubj":{
373
+ "p":0.8939393939,
374
+ "r":0.8713080169,
375
+ "f":0.8824786325
376
  },
377
  "case":{
378
+ "p":0.9141414141,
379
+ "r":0.8942687747,
380
+ "f":0.9040959041
381
  },
382
  "obl":{
383
+ "p":0.7286585366,
384
+ "r":0.7433903577,
385
+ "f":0.7359507313
386
  },
387
  "cc":{
388
+ "p":0.8486646884,
389
+ "r":0.8313953488,
390
+ "f":0.8399412628
391
  },
392
  "conj":{
393
+ "p":0.671957672,
394
+ "r":0.6773333333,
395
+ "f":0.6746347942
396
  },
397
  "obj":{
398
+ "p":0.8560747664,
399
+ "r":0.8893203883,
400
+ "f":0.8723809524
401
  },
402
  "aux":{
403
+ "p":0.8885542169,
404
+ "r":0.860058309,
405
+ "f":0.8740740741
406
  },
407
  "acl:relcl":{
408
+ "p":0.6936416185,
409
+ "r":0.6486486486,
410
+ "f":0.6703910615
411
  },
412
  "obl:loc":{
413
+ "p":0.7222222222,
414
+ "r":0.7428571429,
415
+ "f":0.7323943662
416
  },
417
  "det":{
418
+ "p":0.9346733668,
419
+ "r":0.9192751236,
420
+ "f":0.926910299
421
  },
422
  "amod":{
423
+ "p":0.8549488055,
424
+ "r":0.8549488055,
425
+ "f":0.8549488055
426
  },
427
  "nmod:poss":{
428
+ "p":0.75,
429
+ "r":0.7128712871,
430
+ "f":0.730964467
431
  },
432
  "ccomp":{
433
+ "p":0.6885245902,
434
+ "r":0.6774193548,
435
+ "f":0.6829268293
436
  },
437
  "nummod":{
438
+ "p":0.8181818182,
439
+ "r":0.825,
440
+ "f":0.8215767635
441
  },
442
  "flat":{
443
+ "p":0.8636363636,
444
+ "r":0.880794702,
445
+ "f":0.8721311475
446
  },
447
  "compound:prt":{
448
+ "p":0.6551724138,
449
+ "r":0.4634146341,
450
+ "f":0.5428571429
451
  },
452
  "advcl":{
453
+ "p":0.6967213115,
454
+ "r":0.7327586207,
455
+ "f":0.7142857143
456
  },
457
  "mark":{
458
+ "p":0.9018789144,
459
+ "r":0.887063655,
460
+ "f":0.8944099379
461
  },
462
  "cop":{
463
+ "p":0.8514285714,
464
+ "r":0.8514285714,
465
+ "f":0.8514285714
466
  },
467
  "dep":{
468
+ "p":0.1960784314,
469
+ "r":0.3773584906,
470
+ "f":0.2580645161
471
  },
472
  "nmod":{
473
+ "p":0.7197452229,
474
+ "r":0.662109375,
475
+ "f":0.6897253306
476
  },
477
  "iobj":{
478
+ "p":0.7333333333,
479
+ "r":0.5,
480
+ "f":0.5945945946
 
 
 
 
 
481
  },
482
  "xcomp":{
483
+ "p":0.6315789474,
484
+ "r":0.406779661,
485
+ "f":0.4948453608
486
+ },
487
+ "list":{
488
+ "p":0.3636363636,
489
+ "r":0.2222222222,
490
+ "f":0.275862069
491
  },
492
  "vocative":{
493
  "p":0.0,
 
495
  "f":0.0
496
  },
497
  "fixed":{
498
+ "p":0.8947368421,
499
+ "r":0.8095238095,
500
+ "f":0.85
 
 
 
 
 
501
  },
502
  "expl":{
503
+ "p":0.9090909091,
504
+ "r":0.8823529412,
505
+ "f":0.8955223881
506
+ },
507
+ "appos":{
508
+ "p":0.6097560976,
509
+ "r":0.7575757576,
510
+ "f":0.6756756757
511
  },
512
  "obl:tmod":{
513
+ "p":0.8,
514
+ "r":0.2222222222,
515
+ "f":0.347826087
516
  },
517
  "discourse":{
518
  "p":0.0,
 
520
  "f":0.0
521
  }
522
  },
523
+ "sents_p":0.8603839442,
524
+ "sents_r":0.8741134752,
525
+ "sents_f":0.8671943712,
526
  "lemma_acc":0.8491041162,
527
+ "ents_f":0.8231644261,
528
+ "ents_p":0.81724846,
529
+ "ents_r":0.8291666667,
530
  "ents_per_type":{
 
 
 
 
 
531
  "PER":{
532
+ "p":0.9290322581,
533
+ "r":0.8674698795,
534
+ "f":0.8971962617
535
+ },
536
+ "ORG":{
537
+ "p":0.7619047619,
538
+ "r":0.7111111111,
539
+ "f":0.7356321839
540
  },
541
  "MISC":{
542
+ "p":0.6739130435,
543
+ "r":0.8230088496,
544
+ "f":0.7410358566
545
  },
546
  "LOC":{
547
+ "p":0.8818181818,
548
+ "r":0.8738738739,
549
+ "f":0.8778280543
550
  }
551
  },
552
+ "transformer_loss":417466.8663170633,
553
+ "morphologizer_loss":34589.6649030063,
554
+ "parser_loss":151048.9837691551,
555
+ "ner_loss":5460.9844742843
556
  },
557
+ "sources":[
558
+ {
559
+ "name":"UD Danish DDT v2.5",
560
+ "url":"https://github.com/UniversalDependencies/UD_Danish-DDT",
561
+ "license":"CC BY-SA 4.0",
562
+ "author":"Johannsen, Anders; Mart\u00ednez Alonso, H\u00e9ctor; Plank, Barbara"
563
+ },
564
+ {
565
+ "name":"DaNE",
566
+ "url":"https://github.com/alexandrainst/danlp/blob/master/docs/datasets.md#danish-dependency-treebank-dane",
567
+ "license":"CC BY-SA 4.0",
568
+ "author":"Rasmus Hvingelby, Amalie B. Pauli, Maria Barrett, Christina Rosted, Lasse M. Lidegaard, Anders S\u00f8gaard"
569
+ },
570
+ {
571
+ "name":"Maltehb/-l-ctra-danish-electra-small-cased",
572
+ "author":"Malte H\u00f8jmark-Bertelsen",
573
+ "url":"https://huggingface.co/Maltehb/-l-ctra-danish-electra-small-cased",
574
+ "license":"CC BY 4.0"
575
+ }
576
+ ],
577
+ "requirements":[
578
+ "spacy-transformers>=1.0.3,<1.1.0"
579
+ ],
580
+ "notes":"\n## Bias and Robustness\n\nBesides the validation done by SpaCy on the DaNE testset, DaCy also provides a series of augmentations to the DaNE test set to see how well the models deal with these types of augmentations.\nThe can be seen as behavioural probes akinn to the NLP checklist.\n\n### Deterministic Augmentations\nDeterministic augmentations are augmentation which always yield the same result.\n\n| Augmentation | Part-of-speech tagging (Accuracy) | Morphological tagging (Accuracy) | Dependency Parsing (UAS) | Dependency Parsing (LAS) |\u00a0Sentence segmentation (F1) | Lemmatization (Accuracy) | Named entity recognition (F1) |\n| --- | --- | --- | --- | --- | --- | --- | --- |\n| No augmentation | 0.98 | 0.974 | 0.868 | 0.836 | 0.936 | 0.844 | 0.765 |\n| \u00c6\u00f8\u00e5 Augmentation | 0.955 | 0.948 | 0.823 | 0.783 | 0.922 | 0.754 | 0.718 |\n| Lowercase | 0.974 | 0.97 | 0.862 | 0.828 | 0.905 | 0.848 | 0.681 |\n| No Spacing | 0.229 | 0.229 | 0.004 | 0.003 | 0.824 | 0.225 | 0.048 |\n| Abbreviated first names | 0.979 | 0.973 | 0.864 | 0.832 | 0.94 | 0.845 | 0.699 |\n| Input size augmentation 5 sentences | 0.956 | 0.956 | 0.851 | 0.818 | 0.883 | 0.844 | 0.743 |\n| Input size augmentation 10 sentences | 0.959 | 0.958 | 0.853 | 0.821 | 0.897 | 0.844 | 0.755 |\n\n\n\n### Stochastic Augmentations\nStochastic augmentations are augmentation which are repeated mulitple times to estimate the effect of the augmentation.\n\n| Augmentation | Part-of-speech tagging (Accuracy) | Morphological tagging (Accuracy) | Dependency Parsing (UAS) | Dependency Parsing (LAS) |\u00a0Sentence segmentation (F1) | Lemmatization (Accuracy) | Named entity recognition (F1) |\n| --- | --- | --- | --- | --- | --- | --- | --- |\n| Keystroke errors 2% | 0.931 (0.003) | 0.929 (0.003) | 0.797 (0.003) | 0.753 (0.003) | 0.884 (0.003) | 0.772 (0.003) | 0.657 (0.003) |\n| Keystroke errors 5% | 0.859 (0.003) | 0.863 (0.003) | 0.699 (0.003) | 0.641 (0.003) | 0.824 (0.003) | 0.681 (0.003) | 0.53 (0.003) |\n| Keystroke errors 15% | 0.633 (0.006) | 0.662 (0.006) | 0.439 (0.006) | 0.358 (0.006) | 0.688 (0.006) | 0.459 (0.006) | 0.293 (0.006) |\n| Danish names | 0.979 (0.0) | 0.974 (0.0) | 0.867 (0.0) | 0.835 (0.0) | 0.943 (0.0) | 0.847 (0.0) | 0.748 (0.0) |\n| Muslim names | 0.979 (0.0) | 0.974 (0.0) | 0.865 (0.0) | 0.833 (0.0) | 0.94 (0.0) | 0.847 (0.0) | 0.732 (0.0) |\n| Female names | 0.979 (0.0) | 0.974 (0.0) | 0.867 (0.0) | 0.835 (0.0) | 0.946 (0.0) | 0.847 (0.0) | 0.754 (0.0) |\n| Male names | 0.979 (0.0) | 0.974 (0.0) | 0.867 (0.0) | 0.835 (0.0) | 0.943 (0.0) | 0.847 (0.0) | 0.748 (0.0) |\n| Spacing Augmention 5% | 0.941 (0.002) | 0.936 (0.002) | 0.755 (0.002) | 0.725 (0.002) | 0.907 (0.002) | 0.811 (0.002) | 0.699 (0.002) |\n\n<details>\n\n<summary> Description of Augmenters </summary>\n\n \n\n**No augmentation:**\nApplies no augmentation to the DaNE test set.\n\n**\u00c6\u00f8\u00e5 Augmentation:**\nThis augmentation replace the \u00e6,\u00f8, and \u00e5 with their spelling variations ae, oe and aa respectively.\n\n**Lowercase:**\nThis augmentation lowercases all text.\n\n**No Spacing:**\nThis augmentation removed all spacing from the text.\n\n**Abbreviated first names:**\nThis agmentation abbreviates the first names of entities. For instance 'Kenneth Enevoldsen' would turn to 'K. Enevoldsen'.\n\n**Keystroke errors 2%:**\nThis agmentation simulate keystroke errors by replacing 2% of keys with a neighbouring key on a Danish QWERTY keyboard. As this agmentation is stochastic it is repeated 20 times to obtain a consistent estimate and the mean is provided with its standard deviation in parenthesis.\n\n**Keystroke errors 5%:**\nThis agmentation simulate keystroke errors by replacing 5% of keys with a neighbouring key on a Danish QWERTY keyboard. As this agmentation is stochastic it is repeated 20 times to obtain a consistent estimate and the mean is provided with its standard deviation in parenthesis.\n\n**Keystroke errors 15%:**\nThis agmentation simulate keystroke errors by replacing 15% of keys with a neighbouring key on a Danish QWERTY keyboard. As this agmentation is stochastic it is repeated 20 times to obtain a consistent estimate and the mean is provided with its standard deviation in parenthesis.\n\n**Danish names:**\nThis agmentation replace all names with Danish names derived from Danmarks Statistik (2021). As this agmentation is stochastic it is repeated 20 times to obtain a consistent estimate and the mean is provided with its standard deviation in parenthesis.\n\n**Muslim names:**\nThis agmentation replace all names with Muslim names derived from Meldgaard (2005). As this agmentation is stochastic it is repeated 20 times to obtain a consistent estimate and the mean is provided with its standard deviation in parenthesis.\n\n**Female names:**\nThis agmentation replace all names with Danish female names derived from Danmarks Statistik (2021). As this agmentation is stochastic it is repeated 20 times to obtain a consistent estimate and the mean is provided with its standard deviation in parenthesis.\n\n**Male names:**\nThis agmentation replace all names with Danish male names derived from Danmarks Statistik (2021). As this agmentation is stochastic it is repeated 20 times to obtain a consistent estimate and the mean is provided with its standard deviation in parenthesis.\n\n**Spacing Augmention 5%:**\nThis agmentation replace all names with Danish male names derived from Danmarks Statistik (2021). As this agmentation is stochastic it is repeated 20 times to obtain a consistent estimate and the mean is provided with its standard deviation in parenthesis.\n </details> \n <br /> \n\n\n### Hardware\nThis was run an trained on a Quadro RTX 8000 GPU."
581
  }
morphologizer/model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:e6593f44c84807d9093ae279607d28e4a3830cac3bd957ffa700d9f1992be852
3
  size 161992
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:601cec06d7bb6f1e2025cf6878f5c8fb02d89b5fc71ba82c80e718a28c63f87f
3
  size 161992
ner/model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:cce84a8f6f8737880302491bc844be40901a9e33a7a8091647dd35b087c72ce3
3
  size 94890
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6c7bd95a31a59f7cb632de4a99c12643602828d312d04a7ba233f3bdb7f15778
3
  size 94890
parser/model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:d3df01ad2eae13f30f68b5ebe46564c0b167ce1026da37052967675a3b7f8438
3
  size 325085
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:db9711e97c156d5c9892a65b87d6a185289f74b92dcec527cf6906dfb6e821a6
3
  size 325085
transformer/model/pytorch_model.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:0b7a4a6fc863a3b3bd76b8b040ae7925f5a20d8d6987140b037ca9791b06ac0a
3
  size 54773654
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d65643fe23c672180685635b539688406638af1f7e515cb89505ea7626127400
3
  size 54773654
transformer/model/tokenizer_config.json CHANGED
@@ -1 +1 @@
1
- {"do_lower_case": false, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": false, "special_tokens_map_file": null, "full_tokenizer_file": null, "model_max_length": 128, "name_or_path": "Maltehb/-l-ctra-danish-electra-small-cased", "do_basic_tokenize": true, "never_split": null}
 
1
+ {"do_lower_case": false, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "special_tokens_map_file": null, "full_tokenizer_file": null, "model_max_length": 128, "name_or_path": "Maltehb/-l-ctra-danish-electra-small-cased", "do_basic_tokenize": true, "never_split": null}
vocab/strings.json CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:86381420bcac876c95ffecbd0b41da7e614440eef239354586defa5b5a5e9735
3
- size 457618
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5b50a86603f748496e4fd87a8aaa203a32bf82d4b3768bf54187ff40de3ca6f9
3
+ size 460120