danielschnell commited on
Commit
4f09c24
·
1 Parent(s): 11e15eb

Initial commit with adapted deliverables from Clarin: http://hdl.handle.net/20.500.12537/301

Browse files

General:

- use submodules for electra model and Tokenizer package
- create missing directories
- add .python-version for 3.8 (same as for biaffine_parser)
- add .gitignore
- model file best.th in git-lfs

README.md:

- removed the Icelandic text because of redundancies



Signed-off-by: Daniel Schnell <dschnell@grammatek.com>

.gitignore ADDED
@@ -0,0 +1 @@
 
 
1
+ venv*/
.gitmodules ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ [submodule "transformer_models/electra-base-igc-is"]
2
+ path = transformer_models/electra-base-igc-is
3
+ url = https://huggingface.co/Icelandic-lt/electra-base-igc-is
.python-version ADDED
@@ -0,0 +1 @@
 
 
1
+ 3.8
README.md CHANGED
@@ -1,3 +1,69 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## A Universal Dependency parser built on top of a Transformer language model
2
+
3
+ Score on pre-tokenized test data:
4
+
5
+ ```
6
+ Metric | Precision | Recall | F1 Score | AligndAcc
7
+ -----------+-----------+-----------+-----------+-----------
8
+ Tokens | 99.70 | 99.77 | 99.73 |
9
+ Sentences | 100.00 | 100.00 | 100.00 |
10
+ Words | 99.62 | 99.61 | 99.61 |
11
+ UPOS | 96.99 | 96.97 | 96.98 | 97.36
12
+ XPOS | 93.65 | 93.64 | 93.65 | 94.01
13
+ UFeats | 91.31 | 91.29 | 91.30 | 91.65
14
+ AllTags | 86.86 | 86.85 | 86.86 | 87.19
15
+ Lemmas | 95.83 | 95.81 | 95.82 | 96.19
16
+ UAS | 89.01 | 89.00 | 89.00 | 89.35
17
+ LAS | 85.72 | 85.70 | 85.71 | 86.04
18
+ CLAS | 81.39 | 80.91 | 81.15 | 81.34
19
+ MLAS | 69.21 | 68.81 | 69.01 | 69.17
20
+ BLEX | 77.44 | 76.99 | 77.22 | 77.40
21
+ ```
22
+
23
+
24
+ Score on untokenized test data:
25
+
26
+ ```
27
+ Metric | Precision | Recall | F1 Score | AligndAcc
28
+ -----------+-----------+-----------+-----------+-----------
29
+ Tokens | 99.50 | 99.66 | 99.58 |
30
+ Sentences | 100.00 | 100.00 | 100.00 |
31
+ Words | 99.42 | 99.50 | 99.46 |
32
+ UPOS | 96.80 | 96.88 | 96.84 | 97.37
33
+ XPOS | 93.48 | 93.56 | 93.52 | 94.03
34
+ UFeats | 91.13 | 91.20 | 91.16 | 91.66
35
+ AllTags | 86.71 | 86.78 | 86.75 | 87.22
36
+ Lemmas | 95.66 | 95.74 | 95.70 | 96.22
37
+ UAS | 88.76 | 88.83 | 88.80 | 89.28
38
+ LAS | 85.49 | 85.55 | 85.52 | 85.99
39
+ CLAS | 81.19 | 80.73 | 80.96 | 81.31
40
+ MLAS | 69.06 | 68.67 | 68.87 | 69.16
41
+ BLEX | 77.28 | 76.84 | 77.06 | 77.39
42
+ ````
43
+
44
+ To use the model, you need to setup COMBO, which makes it possible to use word embeddings from a pre-trained transformer model ([electra-base-igc-is](https://huggingface.co/Icelandic-lt/electra-base-igc-is)).
45
+
46
+ ```bash
47
+ git submodule update --init --recursive
48
+ pip install -U pip setuptools wheel
49
+ pip install --index-url https://pypi.clarin-pl.eu/simple combo==1.0.5
50
+ ```
51
+
52
+ * For Python 3.9, you might need to install cython:
53
+
54
+ ```bash
55
+ pip install -U pip cython
56
+ ```
57
+
58
+ * Then you can run the model as it is done in parse_file.py
59
+
60
+ For more instructions, see here: https://gitlab.clarin-pl.eu/syntactic-tools/combo
61
+
62
+ The Tokenizer directory is a clone of [Miðeind's tokenizer](https://github.com/icelandic-lt/Tokenizer).
63
+
64
+ The directory `transformer_models/` contains the pretrained model [electra-base-igc-is](https://huggingface.co/Icelandic-lt/electra-base-igc-is),
65
+ which supplies the parser with contextual embeddings and attention, trained by Jón Friðrik Daðason.
66
+
67
+
68
+ ## License
69
+ https://opensource.org/licenses/Apache-2.0
Tokenizer ADDED
@@ -0,0 +1 @@
 
 
1
+ Subproject commit 5ae4551ad3a3a99ad657bd0528dd4147f4f5f95f
combo-is-combined-v211/best.th ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:98106212011d3b33fffa708d8497e0266bf233244b5209a6b2027640aeb729e9
3
+ size 488929748
combo-is-combined-v211/config.json ADDED
@@ -0,0 +1,355 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "dataset_reader": {
3
+ "type": "conllu",
4
+ "features": [
5
+ "token",
6
+ "char"
7
+ ],
8
+ "lemma_indexers": {
9
+ "char": {
10
+ "type": "characters_const_padding",
11
+ "character_tokenizer": {
12
+ "end_tokens": [
13
+ "__END__"
14
+ ],
15
+ "start_tokens": [
16
+ "__START__"
17
+ ]
18
+ },
19
+ "min_padding_length": 32,
20
+ "namespace": "lemma_characters"
21
+ }
22
+ },
23
+ "targets": [
24
+ "deprel",
25
+ "feats",
26
+ "head",
27
+ "lemma",
28
+ "upostag",
29
+ "xpostag"
30
+ ],
31
+ "token_indexers": {
32
+ "char": {
33
+ "type": "characters_const_padding",
34
+ "character_tokenizer": {
35
+ "end_tokens": [
36
+ "__END__"
37
+ ],
38
+ "start_tokens": [
39
+ "__START__"
40
+ ]
41
+ },
42
+ "min_padding_length": 32
43
+ },
44
+ "feats": {
45
+ "type": "feats_indexer"
46
+ },
47
+ "lemma": {
48
+ "type": "characters_const_padding",
49
+ "character_tokenizer": {
50
+ "end_tokens": [
51
+ "__END__"
52
+ ],
53
+ "start_tokens": [
54
+ "__START__"
55
+ ]
56
+ },
57
+ "min_padding_length": 32
58
+ },
59
+ "token": {
60
+ "type": "pretrained_transformer_mismatched_fixed",
61
+ "model_name": "transformer_models/electra-base-igc-is/",
62
+ "tokenizer_kwargs": {}
63
+ },
64
+ "upostag": {
65
+ "type": "single_id",
66
+ "feature_name": "pos_",
67
+ "namespace": "upostag"
68
+ },
69
+ "xpostag": {
70
+ "type": "single_id",
71
+ "feature_name": "tag_",
72
+ "namespace": "xpostag"
73
+ }
74
+ },
75
+ "use_sem": false
76
+ },
77
+ "model": {
78
+ "type": "semantic_multitask",
79
+ "dependency_relation": {
80
+ "type": "combo_dependency_parsing_from_vocab",
81
+ "dependency_projection_layer": {
82
+ "activation": "tanh",
83
+ "dropout_rate": 0.25,
84
+ "in_features": 1024,
85
+ "out_features": 128
86
+ },
87
+ "head_predictor": {
88
+ "cycle_loss_n": 0,
89
+ "dependency_projection_layer": {
90
+ "activation": "tanh",
91
+ "in_features": 1024,
92
+ "out_features": 512
93
+ },
94
+ "head_projection_layer": {
95
+ "activation": "tanh",
96
+ "in_features": 1024,
97
+ "out_features": 512
98
+ }
99
+ },
100
+ "head_projection_layer": {
101
+ "activation": "tanh",
102
+ "dropout_rate": 0.25,
103
+ "in_features": 1024,
104
+ "out_features": 128
105
+ },
106
+ "vocab_namespace": "deprel_labels"
107
+ },
108
+ "lemmatizer": {
109
+ "type": "combo_lemma_predictor_from_vocab",
110
+ "activations": [
111
+ "relu",
112
+ "relu",
113
+ "relu",
114
+ "linear"
115
+ ],
116
+ "char_vocab_namespace": "token_characters",
117
+ "dilation": [
118
+ 1,
119
+ 2,
120
+ 4,
121
+ 1
122
+ ],
123
+ "embedding_dim": 256,
124
+ "filters": [
125
+ 256,
126
+ 256,
127
+ 256
128
+ ],
129
+ "input_projection_layer": {
130
+ "activation": "tanh",
131
+ "dropout_rate": 0.25,
132
+ "in_features": 1024,
133
+ "out_features": 32
134
+ },
135
+ "kernel_size": [
136
+ 3,
137
+ 3,
138
+ 3,
139
+ 1
140
+ ],
141
+ "lemma_vocab_namespace": "lemma_characters",
142
+ "padding": [
143
+ 1,
144
+ 2,
145
+ 4,
146
+ 0
147
+ ],
148
+ "stride": [
149
+ 1,
150
+ 1,
151
+ 1,
152
+ 1
153
+ ]
154
+ },
155
+ "loss_weights": {
156
+ "deprel": 0.8,
157
+ "feats": 0.2,
158
+ "head": 0.2,
159
+ "lemma": 0.05,
160
+ "semrel": 0.05,
161
+ "upostag": 0.05,
162
+ "xpostag": 0.05
163
+ },
164
+ "morphological_feat": {
165
+ "type": "combo_morpho_from_vocab",
166
+ "activations": [
167
+ "tanh",
168
+ "linear"
169
+ ],
170
+ "dropout": [
171
+ 0.25,
172
+ 0
173
+ ],
174
+ "hidden_dims": [
175
+ 128
176
+ ],
177
+ "input_dim": 1024,
178
+ "num_layers": 2,
179
+ "vocab_namespace": "feats_labels"
180
+ },
181
+ "regularizer": {
182
+ "regexes": [
183
+ [
184
+ ".*conv1d.*",
185
+ {
186
+ "alpha": 1e-06,
187
+ "type": "l2"
188
+ }
189
+ ],
190
+ [
191
+ ".*forward.*",
192
+ {
193
+ "alpha": 1e-06,
194
+ "type": "l2"
195
+ }
196
+ ],
197
+ [
198
+ ".*backward.*",
199
+ {
200
+ "alpha": 1e-06,
201
+ "type": "l2"
202
+ }
203
+ ],
204
+ [
205
+ ".*char_embed.*",
206
+ {
207
+ "alpha": 1e-05,
208
+ "type": "l2"
209
+ }
210
+ ]
211
+ ]
212
+ },
213
+ "seq_encoder": {
214
+ "type": "combo_encoder",
215
+ "layer_dropout_probability": 0.33,
216
+ "stacked_bilstm": {
217
+ "hidden_size": 512,
218
+ "input_size": 164,
219
+ "layer_dropout_probability": 0.33,
220
+ "num_layers": 2,
221
+ "recurrent_dropout_probability": 0.33
222
+ }
223
+ },
224
+ "text_field_embedder": {
225
+ "type": "basic",
226
+ "token_embedders": {
227
+ "char": {
228
+ "type": "char_embeddings_from_config",
229
+ "dilated_cnn_encoder": {
230
+ "activations": [
231
+ "relu",
232
+ "relu",
233
+ "linear"
234
+ ],
235
+ "dilation": [
236
+ 1,
237
+ 2,
238
+ 4
239
+ ],
240
+ "filters": [
241
+ 512,
242
+ 256,
243
+ 64
244
+ ],
245
+ "input_dim": 64,
246
+ "kernel_size": [
247
+ 3,
248
+ 3,
249
+ 3
250
+ ],
251
+ "padding": [
252
+ 1,
253
+ 2,
254
+ 4
255
+ ],
256
+ "stride": [
257
+ 1,
258
+ 1,
259
+ 1
260
+ ]
261
+ },
262
+ "embedding_dim": 64
263
+ },
264
+ "token": {
265
+ "type": "transformers_word_embeddings",
266
+ "model_name": "transformer_models/electra-base-igc-is/",
267
+ "projection_dim": 100
268
+ }
269
+ }
270
+ },
271
+ "upos_tagger": {
272
+ "activations": [
273
+ "tanh",
274
+ "linear"
275
+ ],
276
+ "dropout": [
277
+ 0.25,
278
+ 0
279
+ ],
280
+ "hidden_dims": [
281
+ 64
282
+ ],
283
+ "input_dim": 1024,
284
+ "num_layers": 2,
285
+ "vocab_namespace": "upostag_labels"
286
+ },
287
+ "xpos_tagger": {
288
+ "activations": [
289
+ "tanh",
290
+ "linear"
291
+ ],
292
+ "dropout": [
293
+ 0.25,
294
+ 0
295
+ ],
296
+ "hidden_dims": [
297
+ 128
298
+ ],
299
+ "input_dim": 1024,
300
+ "num_layers": 2,
301
+ "vocab_namespace": "xpostag_labels"
302
+ }
303
+ },
304
+ "train_data_path": "is_combined-ud-train.conllu",
305
+ "validation_data_path": "is_combined-ud-dev.conllu",
306
+ "trainer": {
307
+ "type": "gradient_descent_validate_n",
308
+ "checkpointer": {
309
+ "type": "finishing_only_checkpointer"
310
+ },
311
+ "cuda_device": 0,
312
+ "epoch_callbacks": [
313
+ {
314
+ "type": "transfer_patience"
315
+ }
316
+ ],
317
+ "grad_clipping": 5,
318
+ "learning_rate_scheduler": {
319
+ "type": "combo_scheduler"
320
+ },
321
+ "num_epochs": 400,
322
+ "optimizer": {
323
+ "type": "adam",
324
+ "betas": [
325
+ 0.9,
326
+ 0.9
327
+ ],
328
+ "lr": 0.002
329
+ },
330
+ "patience": 1,
331
+ "validation_metric": "+EM"
332
+ },
333
+ "vocabulary": {
334
+ "type": "from_instances_extended",
335
+ "non_padded_namespaces": [
336
+ "head_labels"
337
+ ],
338
+ "only_include_pretrained_words": true,
339
+ "oov_token": "_",
340
+ "padding_token": "__PAD__"
341
+ },
342
+ "type": "default",
343
+ "data_loader": {
344
+ "batch_sampler": {
345
+ "type": "token_count",
346
+ "word_batch_size": 2500
347
+ }
348
+ },
349
+ "datasets_for_vocab_creation": [
350
+ "train"
351
+ ],
352
+ "numpy_seed": 8787,
353
+ "pytorch_seed": 8787,
354
+ "random_seed": 8787
355
+ }
combo-is-combined-v211/metrics.json ADDED
@@ -0,0 +1,81 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_epoch": 210,
3
+ "peak_worker_0_memory_MB": 8525.6796875,
4
+ "peak_gpu_0_memory_MB": 2498.61328125,
5
+ "training_duration": "7:24:49.411776",
6
+ "training_start_epoch": 0,
7
+ "training_epochs": 210,
8
+ "epoch": 210,
9
+ "training_UPOS_ACC": 0.9787108946284663,
10
+ "training_XPOS_ACC": 0.9391129495876493,
11
+ "training_SEMREL_ACC": 0.0,
12
+ "training_LEMMA_ACC": 0.9803898255479717,
13
+ "training_FEATS_ACC": 0.919626898433172,
14
+ "training_EM": 0.8210317492485046,
15
+ "training_UAS": 0.94157843024532,
16
+ "training_LAS": 0.9163996926968465,
17
+ "training_UEM": 0.5165940054495913,
18
+ "training_LEM": 0.3965122615803815,
19
+ "training_EUAS": 0.0,
20
+ "training_ELAS": 0.0,
21
+ "training_EUEM": 0.0,
22
+ "training_ELEM": 0.0,
23
+ "training_partial_loss/upostag_loss": 0.1905619502067566,
24
+ "training_partial_loss/xpostag_loss": 0.6266801357269287,
25
+ "training_partial_loss/feats_loss": 1.8628418445587158,
26
+ "training_partial_loss/lemma_loss": 0.06094415858387947,
27
+ "training_partial_loss/head_loss": 0.5034704804420471,
28
+ "training_partial_loss/deprel_loss": 0.3214435279369354,
29
+ "training_partial_loss/cycle_loss": 0.0,
30
+ "training_loss": 0.973405952516355,
31
+ "training_reg_loss": 0.2107603544565408,
32
+ "training_worker_0_memory_MB": 8525.6796875,
33
+ "training_gpu_0_memory_MB": 2498.61328125,
34
+ "training_patience": 1,
35
+ "validation_UPOS_ACC": 0.9768967445412763,
36
+ "validation_XPOS_ACC": 0.9429237756229287,
37
+ "validation_SEMREL_ACC": 0.0,
38
+ "validation_LEMMA_ACC": 0.9705140269765286,
39
+ "validation_FEATS_ACC": 0.9284603739617855,
40
+ "validation_EM": 0.7663625478744507,
41
+ "validation_UAS": 0.8889570803158627,
42
+ "validation_LAS": 0.8563888547931754,
43
+ "validation_UEM": 0.3770460601446517,
44
+ "validation_LEM": 0.2700799390940236,
45
+ "validation_EUAS": 0.0,
46
+ "validation_ELAS": 0.0,
47
+ "validation_EUEM": 0.0,
48
+ "validation_ELEM": 0.0,
49
+ "validation_partial_loss/upostag_loss": 0.3415434658527374,
50
+ "validation_partial_loss/xpostag_loss": 0.820046067237854,
51
+ "validation_partial_loss/feats_loss": 1.8135327100753784,
52
+ "validation_partial_loss/lemma_loss": 0.12137448042631149,
53
+ "validation_partial_loss/head_loss": 1.9168964624404907,
54
+ "validation_partial_loss/deprel_loss": 25.25812530517578,
55
+ "validation_partial_loss/cycle_loss": 0.0,
56
+ "validation_loss": 24.833287269904694,
57
+ "validation_reg_loss": 0.0,
58
+ "best_validation_UPOS_ACC": 0.9768967445412763,
59
+ "best_validation_XPOS_ACC": 0.9429237756229287,
60
+ "best_validation_SEMREL_ACC": 0.0,
61
+ "best_validation_LEMMA_ACC": 0.9705140269765286,
62
+ "best_validation_FEATS_ACC": 0.9284603739617855,
63
+ "best_validation_EM": 0.7663625478744507,
64
+ "best_validation_UAS": 0.8889570803158627,
65
+ "best_validation_LAS": 0.8563888547931754,
66
+ "best_validation_UEM": 0.3770460601446517,
67
+ "best_validation_LEM": 0.2700799390940236,
68
+ "best_validation_EUAS": 0.0,
69
+ "best_validation_ELAS": 0.0,
70
+ "best_validation_EUEM": 0.0,
71
+ "best_validation_ELEM": 0.0,
72
+ "best_validation_partial_loss/upostag_loss": 0.3415434658527374,
73
+ "best_validation_partial_loss/xpostag_loss": 0.820046067237854,
74
+ "best_validation_partial_loss/feats_loss": 1.8135327100753784,
75
+ "best_validation_partial_loss/lemma_loss": 0.12137448042631149,
76
+ "best_validation_partial_loss/head_loss": 1.9168964624404907,
77
+ "best_validation_partial_loss/deprel_loss": 25.25812530517578,
78
+ "best_validation_partial_loss/cycle_loss": 0.0,
79
+ "best_validation_loss": 24.833287269904694,
80
+ "best_validation_reg_loss": 0.0
81
+ }
combo-is-combined-v211/vocabulary/.lock ADDED
File without changes
combo-is-combined-v211/vocabulary/deprel_labels.txt ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ _
2
+ punct
3
+ obl
4
+ nsubj
5
+ case
6
+ advmod
7
+ cc
8
+ conj
9
+ mark
10
+ root
11
+ obj
12
+ amod
13
+ nmod:poss
14
+ cop
15
+ ccomp
16
+ aux
17
+ det
18
+ advcl
19
+ acl
20
+ acl:relcl
21
+ compound:prt
22
+ appos
23
+ xcomp
24
+ iobj
25
+ dep
26
+ flat:name
27
+ nummod
28
+ parataxis
29
+ vocative
30
+ nmod
31
+ expl
32
+ flat:foreign
33
+ discourse
34
+ fixed
35
+ csubj
36
+ compound
37
+ dislocated
combo-is-combined-v211/vocabulary/feats_labels.txt ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ _
2
+ Number=Sing
3
+ Definite=Ind
4
+ Gender=Masc
5
+ Case=Nom
6
+ Voice=Act
7
+ VerbForm=Fin
8
+ Number=Plur
9
+ Mood=Ind
10
+ Gender=Neut
11
+ Person=3
12
+ PronType=Prs
13
+ Case=Acc
14
+ Case=Dat
15
+ Gender=Fem
16
+ Tense=Past
17
+ Tense=Pres
18
+ Case=Gen
19
+ Definite=Def
20
+ Degree=Pos
21
+ Person=1
22
+ VerbForm=Inf
23
+ PronType=Dem
24
+ PronType=Ind
25
+ Person=2
26
+ Mood=Sub
27
+ Voice=Mid
28
+ VerbForm=Part
29
+ VerbForm=Sup
30
+ Degree=Cmp
31
+ NumType=Card
32
+ Degree=Sup
33
+ Foreign=Yes
34
+ PronType=Int
35
+ Mood=Imp
36
+ NumType=Ord
37
+ NumType=Frac
38
+ Number=None
39
+ Definite=None
40
+ Gender=None
41
+ Case=None
42
+ Voice=None
43
+ VerbForm=None
44
+ Mood=None
45
+ Person=None
46
+ PronType=None
47
+ Tense=None
48
+ Degree=None
49
+ NumType=None
50
+ Foreign=None
combo-is-combined-v211/vocabulary/lemma_characters.txt ADDED
@@ -0,0 +1,94 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ _
2
+ __START__
3
+ __END__
4
+ a
5
+ r
6
+ n
7
+ e
8
+ i
9
+ g
10
+ s
11
+ l
12
+ u
13
+ ð
14
+ t
15
+ k
16
+ h
17
+ m
18
+ v
19
+ f
20
+ o
21
+ þ
22
+ á
23
+ j
24
+ d
25
+ í
26
+ .
27
+ ,
28
+ ú
29
+ ó
30
+ b
31
+ y
32
+ p
33
+ é
34
+ æ
35
+ ö
36
+ "
37
+ ý
38
+ :
39
+ ?
40
+ ;
41
+ x
42
+ c
43
+ 1
44
+ 0
45
+ 2
46
+ !
47
+ z
48
+ 6
49
+ 5
50
+ -
51
+ 3
52
+ 4
53
+ 9
54
+ 7
55
+ 8
56
+ w
57
+
58
+
59
+ q
60
+ S
61
+
62
+ R
63
+ $
64
+ Þ
65
+ M
66
+ /
67
+ G
68
+ L
69
+ B
70
+ Ó
71
+ H
72
+ å
73
+ A
74
+ E
75
+ N
76
+ ü
77
+ ä
78
+ U
79
+ P
80
+ K
81
+ |
82
+ Å
83
+ I
84
+ Ð
85
+ V
86
+ F
87
+ O
88
+ T
89
+ Ú
90
+ Ö
91
+ Á
92
+ Í
93
+ <
94
+ >
combo-is-combined-v211/vocabulary/non_padded_namespaces.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ head_labels
combo-is-combined-v211/vocabulary/token_characters.txt ADDED
@@ -0,0 +1,108 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ _
2
+ __START__
3
+ __END__
4
+ a
5
+ r
6
+ n
7
+ i
8
+ e
9
+ s
10
+ u
11
+ t
12
+ ð
13
+ g
14
+ l
15
+ m
16
+ k
17
+ o
18
+ f
19
+ v
20
+ h
21
+ þ
22
+ á
23
+ d
24
+ í
25
+ .
26
+ ,
27
+ ó
28
+ j
29
+ b
30
+ y
31
+ ö
32
+ p
33
+ æ
34
+ ú
35
+ é
36
+ Þ
37
+ "
38
+ H
39
+ E
40
+ S
41
+ ý
42
+ G
43
+ :
44
+ V
45
+ A
46
+ K
47
+ J
48
+ M
49
+ O
50
+ B
51
+ F
52
+ N
53
+ ?
54
+ ;
55
+ D
56
+ x
57
+ L
58
+ R
59
+ Á
60
+ É
61
+ T
62
+ P
63
+ I
64
+ c
65
+ 1
66
+ Ó
67
+ Í
68
+ 0
69
+ 2
70
+ U
71
+ z
72
+ !
73
+ 6
74
+ -
75
+ Ö
76
+ 5
77
+ 3
78
+ 4
79
+ 9
80
+ 8
81
+ 7
82
+ C
83
+ Æ
84
+ Ú
85
+ X
86
+ Y
87
+ W
88
+
89
+
90
+ /
91
+ <
92
+ >
93
+ Ð
94
+ q
95
+ w
96
+ Ý
97
+
98
+ Z
99
+ å
100
+ Q
101
+ ü
102
+ ä
103
+ |
104
+ +
105
+ Å
106
+ $
107
+ &
108
+ '
combo-is-combined-v211/vocabulary/upostag_labels.txt ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ _
2
+ NOUN
3
+ VERB
4
+ PRON
5
+ PUNCT
6
+ ADP
7
+ ADV
8
+ CCONJ
9
+ AUX
10
+ DET
11
+ PROPN
12
+ SCONJ
13
+ ADJ
14
+ PART
15
+ NUM
16
+ X
17
+ INTJ
combo-is-combined-v211/vocabulary/xpostag_labels.txt ADDED
@@ -0,0 +1,280 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ _
2
+ P
3
+ .
4
+ CONJ
5
+ ADV
6
+ PRO-N
7
+ VBDI
8
+ ,
9
+ N-A
10
+ C
11
+ N-D
12
+ N-N
13
+ PRO-D
14
+ VBPI
15
+ VB
16
+ NPR-N
17
+ PRO-A
18
+ ADJ-N
19
+ BEDI
20
+ N-G
21
+ NS-A
22
+ RP
23
+ TO
24
+ "
25
+ NS-D
26
+ PRO-G
27
+ BEPI
28
+ NS-N
29
+ NEG
30
+ VBN
31
+ ADVR
32
+ VAN
33
+ Q-N
34
+ NPR-D
35
+ ADJ-A
36
+ NPR-G
37
+ Q-A
38
+ D-N
39
+ D-A
40
+ MDPI
41
+ ADJ-D
42
+ NS-G
43
+ NPR-A
44
+ D-D
45
+ Q-D
46
+ HVDI
47
+ HVPI
48
+ VBPS
49
+ MDDS
50
+ VBDS
51
+ FW
52
+ WADV
53
+ NUM-A
54
+ ALSO
55
+ ADJ-G
56
+ BE
57
+ BEDS
58
+ MDDI
59
+ NUM-N
60
+ WPRO-N
61
+ VBI
62
+ ADVS
63
+ RDDI
64
+ ES
65
+ D-G
66
+ Q-G
67
+ BEPS
68
+ HV
69
+ ADJS-N
70
+ HVDS
71
+ BEN
72
+ ONE-N
73
+ INTJ
74
+ ADJR-N
75
+ VAG
76
+ ONE-A
77
+ RDPI
78
+ OTHER-A
79
+ DO
80
+ DODI
81
+ MDPS
82
+ NPRS-D
83
+ OTHER-N
84
+ MD
85
+ NPRS-N
86
+ OTHER-D
87
+ ONE-D
88
+ FP
89
+ WPRO-A
90
+ HVPS
91
+ NUM-D
92
+ QR-N
93
+ WQ
94
+ ADJS-A
95
+ RDN
96
+ QR-A
97
+ RD
98
+ DOPI
99
+ NUM-G
100
+ RPX
101
+ VAN-A
102
+ ADJR-A
103
+ ADJS-D
104
+ QS-N
105
+ SUCH-A
106
+ OTHER-G
107
+ NPRS-G
108
+ VAN-D
109
+ DON
110
+ ADJR-D
111
+ NPRS-A
112
+ DAN
113
+ RDPS
114
+ RDDS
115
+ WPRO-D
116
+ OTHERS-N
117
+ OTHERS-D
118
+ SUCH-N
119
+ WD-N
120
+ SUCH-D
121
+ QS-A
122
+ QS-D
123
+ HVN
124
+ QR-D
125
+ WPRO-G
126
+ OTHERS-A
127
+ WD-A
128
+ VAG-A
129
+ MDN
130
+ ADJR-G
131
+ VAG-D
132
+ FP-N
133
+ CONJ-1
134
+ OTHERS-G
135
+ WD-D
136
+ ONE-G
137
+ BEI
138
+ ADJS-G
139
+ Q
140
+ DOPS
141
+ HAN
142
+ NUM
143
+ MAN-N
144
+ SUCH-G
145
+ N
146
+ FP-A
147
+ WQ-N
148
+ DOI
149
+ QR-G
150
+ WPRO
151
+ CONJ-2
152
+ QS
153
+ VAN-G
154
+ HVI
155
+ VAG-G
156
+ FP-D
157
+ ADJ
158
+ QS-G
159
+ QR
160
+ DODS
161
+ X
162
+ SUCH
163
+ VBN-D
164
+ LS
165
+ CONJ-3
166
+ WQ-A
167
+ ADJR
168
+ CONJ-4
169
+ WD-G
170
+ VBN-A
171
+ VAN-N
172
+ BAN
173
+ VAG-N
174
+ DAN-A
175
+ NPR
176
+ VB-1
177
+ DAN-D
178
+ WQ-D
179
+ NS
180
+ DAG
181
+ WQ-G
182
+ :
183
+ HAG
184
+ OTHER
185
+ WPRO-1
186
+ MAN-D
187
+ as
188
+ ADJS
189
+ ;
190
+ VBN-G
191
+ WADJ-N
192
+ MAG
193
+ RX
194
+ PRO
195
+ RP-2
196
+ CONJ-5
197
+ MDI
198
+ LB
199
+ ONES-N
200
+ WADV-N
201
+ VB-3
202
+ MAN-A
203
+ RP-3
204
+ RDI
205
+ FP-G
206
+ WADV-D
207
+ NEG-1
208
+ FW-D
209
+ G
210
+ VB-2
211
+ TP
212
+ ONE
213
+ MANS-N
214
+ NP
215
+ VAN-A-4
216
+ CONJ-8
217
+ POR-D
218
+ CONJ-6
219
+ RPO-D
220
+ CONJ-9
221
+ VBDP
222
+ NUM-1
223
+ WPRO-N-1
224
+ DVBN
225
+ NPR-1
226
+ MS-N
227
+ POR-A
228
+ WRPO-N
229
+ QDJ-A
230
+ VAN-4
231
+ DV
232
+ BAG
233
+ Q-ADV
234
+ VPDI
235
+ ONES-G
236
+ NPR-V
237
+ ADV-N
238
+ PRO-N-YYY
239
+ REP
240
+ ADJP
241
+ FP-1
242
+ NEG-2
243
+ WADVP-10
244
+ FOREIGN
245
+ WADVP-5
246
+ RP-6
247
+ CONJ-7
248
+ WDD
249
+ IP-INF
250
+ WADV-A
251
+ ADJP-N
252
+ M-D
253
+ ADVS-N
254
+ ONES-A
255
+ FP-N-3
256
+ RAN
257
+ DW
258
+ ADV-1
259
+ VBPI$
260
+ ADVR-A
261
+ QADV
262
+ VDN
263
+ RD-3
264
+ ADVR-D
265
+ ES-A
266
+ PRON
267
+ OTHER-WPRO
268
+ MAG-A
269
+ WS-A
270
+ AN
271
+ RP-1
272
+ ES-1
273
+ PP
274
+ RAG
275
+ ALDJ-N
276
+ FP-2
277
+ N=G
278
+ WD
279
+ VB-4
280
+ HV-1
parse_file.py ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from combo.predict import COMBO
2
+ from allennlp.data import tokenizers
3
+ from argparse import ArgumentParser
4
+
5
+ parser = ArgumentParser()
6
+ parser.add_argument('--parser')
7
+ parser.add_argument('--infile')
8
+ parser.add_argument('--pretokenized', action='store_true')
9
+ args = parser.parse_args()
10
+
11
+ # If your data is pre-tokenized, you can add the --pretokenized flag
12
+ # If you have a GPU available, you can add cuda_device=<your-device> to COMBO.from_pretrained
13
+ # The parser expects input in the same format as test_file.txt, i.e. one sentence per line
14
+
15
+ if args.pretokenized:
16
+ from Tokenizer.src.tokenizer import split_into_sentences
17
+ nlp = COMBO.from_pretrained('combo-is-combined-v211', tokenizer=tokenizers.SpacyTokenizer(split_on_spaces=True))
18
+ else:
19
+ nlp = COMBO.from_pretrained(args.parser)
20
+
21
+ def read_test_file(file):
22
+ with open(file, 'r', encoding='utf-8') as infile:
23
+ for line in infile:
24
+ if args.pretokenized:
25
+ yield ' '.join(split_into_sentences(line))
26
+ else:
27
+ yield line.rstrip()
28
+
29
+ test_file = read_test_file(args.infile)
30
+
31
+ for sent in test_file:
32
+ sentence = nlp(sent)
33
+ for index, token in enumerate(sentence.tokens, 1):
34
+ print(f'{token.id}\t{token.token}\t{token.lemma}\t{token.upostag}\t{token.xpostag}\t{token.feats}\t{token.head}\t{token.deprel}\t{token.deps}\t{token.misc}')
35
+ print()
requirements.txt ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ allennlp==2.10.1
2
+ combo==0.1.3
3
+ diaparser==1.1.2
4
+ Tokenizer==3.4.2
test_file.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ Komið þið sæl.
2
+ Þetta skjal er ætlað til að sýna hvernig þáttarinn virkar.
3
+ Njótið dagsins.
transformer_models/electra-base-igc-is ADDED
@@ -0,0 +1 @@
 
 
1
+ Subproject commit e2921de06b441e2a3066da485d6fa31cf5c816a8