oroszgy commited on
Commit
710b848
1 Parent(s): c7a7918

Update spacy pipeline to 3.5.1

Browse files
README.md CHANGED
@@ -14,69 +14,69 @@ model-index:
14
  metrics:
15
  - name: NER Precision
16
  type: precision
17
- value: 0.8662957645
18
  - name: NER Recall
19
  type: recall
20
- value: 0.848628692
21
  - name: NER F Score
22
  type: f_score
23
- value: 0.8573712256
24
  - task:
25
  name: TAG
26
  type: token-classification
27
  metrics:
28
  - name: TAG (XPOS) Accuracy
29
  type: accuracy
30
- value: 0.9643028041
31
  - task:
32
  name: POS
33
  type: token-classification
34
  metrics:
35
  - name: POS (UPOS) Accuracy
36
  type: accuracy
37
- value: 0.9634414777
38
  - task:
39
  name: MORPH
40
  type: token-classification
41
  metrics:
42
  - name: Morph (UFeats) Accuracy
43
  type: accuracy
44
- value: 0.9310938846
45
  - task:
46
  name: LEMMA
47
  type: token-classification
48
  metrics:
49
  - name: Lemma Accuracy
50
  type: accuracy
51
- value: 0.9722514592
52
  - task:
53
  name: UNLABELED_DEPENDENCIES
54
  type: token-classification
55
  metrics:
56
  - name: Unlabeled Attachment Score (UAS)
57
  type: f_score
58
- value: 0.8222334626
59
  - task:
60
  name: LABELED_DEPENDENCIES
61
  type: token-classification
62
  metrics:
63
  - name: Labeled Attachment Score (LAS)
64
  type: f_score
65
- value: 0.75479121
66
  - task:
67
  name: SENTS
68
  type: token-classification
69
  metrics:
70
  - name: Sentences F-Score
71
  type: f_score
72
- value: 0.9753363229
73
  ---
74
  Core Hungarian model for HuSpaCy. Components: tok2vec, senter, tagger, morphologizer, lemmatizer, parser, ner
75
 
76
  | Feature | Description |
77
  | --- | --- |
78
  | **Name** | `hu_core_news_lg` |
79
- | **Version** | `3.5.0` |
80
  | **spaCy** | `>=3.5.0,<3.6.0` |
81
  | **Default Pipeline** | `tok2vec`, `senter`, `tagger`, `morphologizer`, `lookup_lemmatizer`, `lemmatizer`, `lemma_smoother`, `parser`, `ner` |
82
  | **Components** | `tok2vec`, `senter`, `tagger`, `morphologizer`, `lookup_lemmatizer`, `lemmatizer`, `lemma_smoother`, `parser`, `ner` |
@@ -108,18 +108,18 @@ Core Hungarian model for HuSpaCy. Components: tok2vec, senter, tagger, morpholog
108
  | `TOKEN_P` | 99.86 |
109
  | `TOKEN_R` | 99.93 |
110
  | `TOKEN_F` | 99.89 |
111
- | `SENTS_P` | 98.00 |
112
- | `SENTS_R` | 98.00 |
113
- | `SENTS_F` | 98.00 |
114
- | `TAG_ACC` | 96.76 |
115
- | `POS_ACC` | 96.62 |
116
- | `MORPH_ACC` | 93.54 |
117
- | `MORPH_MICRO_P` | 96.68 |
118
- | `MORPH_MICRO_R` | 96.24 |
119
- | `MORPH_MICRO_F` | 96.46 |
120
- | `LEMMA_ACC` | 97.33 |
121
- | `DEP_UAS` | 81.87 |
122
- | `DEP_LAS` | 74.99 |
123
- | `ENTS_P` | 86.26 |
124
- | `ENTS_R` | 85.76 |
125
- | `ENTS_F` | 86.01 |
14
  metrics:
15
  - name: NER Precision
16
  type: precision
17
+ value: 0.861328125
18
  - name: NER Recall
19
  type: recall
20
+ value: 0.8528481013
21
  - name: NER F Score
22
  type: f_score
23
+ value: 0.8570671378
24
  - task:
25
  name: TAG
26
  type: token-classification
27
  metrics:
28
  - name: TAG (XPOS) Accuracy
29
  type: accuracy
30
+ value: 0.9680845973
31
  - task:
32
  name: POS
33
  type: token-classification
34
  metrics:
35
  - name: POS (UPOS) Accuracy
36
  type: accuracy
37
+ value: 0.9686587875
38
  - task:
39
  name: MORPH
40
  type: token-classification
41
  metrics:
42
  - name: Morph (UFeats) Accuracy
43
  type: accuracy
44
+ value: 0.9363127422
45
  - task:
46
  name: LEMMA
47
  type: token-classification
48
  metrics:
49
  - name: Lemma Accuracy
50
  type: accuracy
51
+ value: 0.9747392594
52
  - task:
53
  name: UNLABELED_DEPENDENCIES
54
  type: token-classification
55
  metrics:
56
  - name: Unlabeled Attachment Score (UAS)
57
  type: f_score
58
+ value: 0.8158633861
59
  - task:
60
  name: LABELED_DEPENDENCIES
61
  type: token-classification
62
  metrics:
63
  - name: Labeled Attachment Score (LAS)
64
  type: f_score
65
+ value: 0.7489046175
66
  - task:
67
  name: SENTS
68
  type: token-classification
69
  metrics:
70
  - name: Sentences F-Score
71
  type: f_score
72
+ value: 0.983277592
73
  ---
74
  Core Hungarian model for HuSpaCy. Components: tok2vec, senter, tagger, morphologizer, lemmatizer, parser, ner
75
 
76
  | Feature | Description |
77
  | --- | --- |
78
  | **Name** | `hu_core_news_lg` |
79
+ | **Version** | `3.5.1` |
80
  | **spaCy** | `>=3.5.0,<3.6.0` |
81
  | **Default Pipeline** | `tok2vec`, `senter`, `tagger`, `morphologizer`, `lookup_lemmatizer`, `lemmatizer`, `lemma_smoother`, `parser`, `ner` |
82
  | **Components** | `tok2vec`, `senter`, `tagger`, `morphologizer`, `lookup_lemmatizer`, `lemmatizer`, `lemma_smoother`, `parser`, `ner` |
108
  | `TOKEN_P` | 99.86 |
109
  | `TOKEN_R` | 99.93 |
110
  | `TOKEN_F` | 99.89 |
111
+ | `SENTS_P` | 98.44 |
112
+ | `SENTS_R` | 98.22 |
113
+ | `SENTS_F` | 98.33 |
114
+ | `TAG_ACC` | 96.81 |
115
+ | `POS_ACC` | 96.87 |
116
+ | `MORPH_ACC` | 93.63 |
117
+ | `MORPH_MICRO_P` | 96.93 |
118
+ | `MORPH_MICRO_R` | 96.36 |
119
+ | `MORPH_MICRO_F` | 96.65 |
120
+ | `LEMMA_ACC` | 97.47 |
121
+ | `DEP_UAS` | 81.59 |
122
+ | `DEP_LAS` | 74.89 |
123
+ | `ENTS_P` | 86.13 |
124
+ | `ENTS_R` | 85.28 |
125
+ | `ENTS_F` | 85.71 |
config.cfg CHANGED
@@ -1,8 +1,8 @@
1
  [paths]
2
- parser_model = "models/hu_core_news_lg-parser-3.5.0/model-best"
3
- ner_model = "models/hu_core_news_lg-ner-3.5.0/model-best"
4
- lemmatizer_lookups = "models/hu_core_news_lg-lookup-lemmatizer-3.5.0"
5
- tagger_model = "models/hu_core_news_lg-tagger-3.5.0/model-best"
6
  train = null
7
  dev = null
8
  vectors = null
1
  [paths]
2
+ parser_model = "models/hu_core_news_lg-parser-3.5.1/model-best"
3
+ ner_model = "models/hu_core_news_lg-ner-3.5.1/model-best"
4
+ lemmatizer_lookups = "models/hu_core_news_lg-lookup-lemmatizer-3.5.1"
5
+ tagger_model = "models/hu_core_news_lg-tagger-3.5.1/model-best"
6
  train = null
7
  dev = null
8
  vectors = null
edit_tree_lemmatizer.py ADDED
@@ -0,0 +1,465 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from functools import lru_cache
2
+
3
+ from typing import cast, Any, Callable, Dict, Iterable, List, Optional
4
+ from typing import Sequence, Tuple, Union
5
+ from collections import Counter
6
+ from copy import deepcopy
7
+ from itertools import islice
8
+ import numpy as np
9
+
10
+ import srsly
11
+ from thinc.api import Config, Model, SequenceCategoricalCrossentropy, NumpyOps
12
+ from thinc.types import Floats2d, Ints2d
13
+
14
+ from spacy.pipeline._edit_tree_internals.edit_trees import EditTrees
15
+ from spacy.pipeline._edit_tree_internals.schemas import validate_edit_tree
16
+ from spacy.pipeline.lemmatizer import lemmatizer_score
17
+ from spacy.pipeline.trainable_pipe import TrainablePipe
18
+ from spacy.errors import Errors
19
+ from spacy.language import Language
20
+ from spacy.tokens import Doc, Token
21
+ from spacy.training import Example, validate_examples, validate_get_examples
22
+ from spacy.vocab import Vocab
23
+ from spacy import util
24
+
25
+
26
+ TOP_K_GUARDRAIL = 20
27
+
28
+
29
+ default_model_config = """
30
+ [model]
31
+ @architectures = "spacy.Tagger.v2"
32
+
33
+ [model.tok2vec]
34
+ @architectures = "spacy.HashEmbedCNN.v2"
35
+ pretrained_vectors = null
36
+ width = 96
37
+ depth = 4
38
+ embed_size = 2000
39
+ window_size = 1
40
+ maxout_pieces = 3
41
+ subword_features = true
42
+ """
43
+ DEFAULT_EDIT_TREE_LEMMATIZER_MODEL = Config().from_str(default_model_config)["model"]
44
+
45
+
46
+ @Language.factory(
47
+ "trainable_lemmatizer_v2",
48
+ assigns=["token.lemma"],
49
+ requires=[],
50
+ default_config={
51
+ "model": DEFAULT_EDIT_TREE_LEMMATIZER_MODEL,
52
+ "backoff": "orth",
53
+ "min_tree_freq": 3,
54
+ "overwrite": False,
55
+ "top_k": 1,
56
+ "overwrite_labels": True,
57
+ "scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"},
58
+ },
59
+ default_score_weights={"lemma_acc": 1.0},
60
+ )
61
+ def make_edit_tree_lemmatizer(
62
+ nlp: Language,
63
+ name: str,
64
+ model: Model,
65
+ backoff: Optional[str],
66
+ min_tree_freq: int,
67
+ overwrite: bool,
68
+ top_k: int,
69
+ overwrite_labels: bool,
70
+ scorer: Optional[Callable],
71
+ ):
72
+ """Construct an EditTreeLemmatizer component."""
73
+ return EditTreeLemmatizer(
74
+ nlp.vocab,
75
+ model,
76
+ name,
77
+ backoff=backoff,
78
+ min_tree_freq=min_tree_freq,
79
+ overwrite=overwrite,
80
+ top_k=top_k,
81
+ overwrite_labels=overwrite_labels,
82
+ scorer=scorer,
83
+ )
84
+
85
+
86
+ # _f = open("lemmatizer.log", "w")
87
+ # def debug(*args):
88
+ # _f.write(" ".join(args) + "\n")
89
+ def debug(*args):
90
+ pass
91
+
92
+
93
+ class EditTreeLemmatizer(TrainablePipe):
94
+ """
95
+ Lemmatizer that lemmatizes each word using a predicted edit tree.
96
+ """
97
+
98
+ def __init__(
99
+ self,
100
+ vocab: Vocab,
101
+ model: Model,
102
+ name: str = "trainable_lemmatizer",
103
+ *,
104
+ backoff: Optional[str] = "orth",
105
+ min_tree_freq: int = 3,
106
+ overwrite: bool = False,
107
+ top_k: int = 1,
108
+ overwrite_labels,
109
+ scorer: Optional[Callable] = lemmatizer_score,
110
+ ):
111
+ """
112
+ Construct an edit tree lemmatizer.
113
+
114
+ backoff (Optional[str]): backoff to use when the predicted edit trees
115
+ are not applicable. Must be an attribute of Token or None (leave the
116
+ lemma unset).
117
+ min_tree_freq (int): prune trees that are applied less than this
118
+ frequency in the training data.
119
+ overwrite (bool): overwrite existing lemma annotations.
120
+ top_k (int): try to apply at most the k most probable edit trees.
121
+ """
122
+ self.vocab = vocab
123
+ self.model = model
124
+ self.name = name
125
+ self.backoff = backoff
126
+ self.min_tree_freq = min_tree_freq
127
+ self.overwrite = overwrite
128
+ self.top_k = top_k
129
+ self.overwrite_labels = overwrite_labels
130
+
131
+ self.trees = EditTrees(self.vocab.strings)
132
+ self.tree2label: Dict[int, int] = {}
133
+
134
+ self.cfg: Dict[str, Any] = {"labels": []}
135
+ self.scorer = scorer
136
+ self.numpy_ops = NumpyOps()
137
+
138
+ def get_loss(
139
+ self, examples: Iterable[Example], scores: List[Floats2d]
140
+ ) -> Tuple[float, List[Floats2d]]:
141
+ validate_examples(examples, "EditTreeLemmatizer.get_loss")
142
+ loss_func = SequenceCategoricalCrossentropy(normalize=False, missing_value=-1)
143
+
144
+ truths = []
145
+ for eg in examples:
146
+ eg_truths = []
147
+ for (predicted, gold_lemma, gold_pos, gold_sent_start) in zip(
148
+ eg.predicted,
149
+ eg.get_aligned("LEMMA", as_string=True),
150
+ eg.get_aligned("POS", as_string=True),
151
+ eg.get_aligned_sent_starts(),
152
+ ):
153
+ if gold_lemma is None:
154
+ label = -1
155
+ else:
156
+ form = self._get_true_cased_form(
157
+ predicted.text, gold_sent_start, gold_pos
158
+ )
159
+ tree_id = self.trees.add(form, gold_lemma)
160
+ # debug(f"@get_loss: {predicted}/{gold_pos}[{gold_sent_start}]->{form}|{gold_lemma}[{tree_id}]")
161
+ label = self.tree2label.get(tree_id, 0)
162
+ eg_truths.append(label)
163
+
164
+ truths.append(eg_truths)
165
+
166
+ d_scores, loss = loss_func(scores, truths)
167
+ if self.model.ops.xp.isnan(loss):
168
+ raise ValueError(Errors.E910.format(name=self.name))
169
+
170
+ return float(loss), d_scores
171
+
172
+ def predict(self, docs: Iterable[Doc]) -> List[Ints2d]:
173
+ if self.top_k == 1:
174
+ scores2guesses = self._scores2guesses_top_k_equals_1
175
+ elif self.top_k <= TOP_K_GUARDRAIL:
176
+ scores2guesses = self._scores2guesses_top_k_greater_1
177
+ else:
178
+ scores2guesses = self._scores2guesses_top_k_guardrail
179
+ # The behaviour of *_scores2guesses_top_k_greater_1()* is efficient for values
180
+ # of *top_k>1* that are likely to be useful when the edit tree lemmatizer is used
181
+ # for its principal purpose of lemmatizing tokens. However, the code could also
182
+ # be used for other purposes, and with very large values of *top_k* the method
183
+ # becomes inefficient. In such cases, *_scores2guesses_top_k_guardrail()* is used
184
+ # instead.
185
+ n_docs = len(list(docs))
186
+ if not any(len(doc) for doc in docs):
187
+ # Handle cases where there are no tokens in any docs.
188
+ n_labels = len(self.cfg["labels"])
189
+ guesses: List[Ints2d] = [self.model.ops.alloc2i(0, n_labels) for _ in docs]
190
+ assert len(guesses) == n_docs
191
+ return guesses
192
+ scores = self.model.predict(docs)
193
+ assert len(scores) == n_docs
194
+ guesses = scores2guesses(docs, scores)
195
+ assert len(guesses) == n_docs
196
+ return guesses
197
+
198
+ def _scores2guesses_top_k_equals_1(self, docs, scores):
199
+ guesses = []
200
+ for doc, doc_scores in zip(docs, scores):
201
+ doc_guesses = doc_scores.argmax(axis=1)
202
+ doc_guesses = self.numpy_ops.asarray(doc_guesses)
203
+
204
+ doc_compat_guesses = []
205
+ for i, token in enumerate(doc):
206
+ tree_id = self.cfg["labels"][doc_guesses[i]]
207
+ form: str = self._get_true_cased_form_of_token(token)
208
+ if self.trees.apply(tree_id, form) is not None:
209
+ doc_compat_guesses.append(tree_id)
210
+ else:
211
+ doc_compat_guesses.append(-1)
212
+ guesses.append(np.array(doc_compat_guesses))
213
+
214
+ return guesses
215
+
216
+ def _scores2guesses_top_k_greater_1(self, docs, scores):
217
+ guesses = []
218
+ top_k = min(self.top_k, len(self.labels))
219
+ for doc, doc_scores in zip(docs, scores):
220
+ doc_scores = self.numpy_ops.asarray(doc_scores)
221
+ doc_compat_guesses = []
222
+ for i, token in enumerate(doc):
223
+ for _ in range(top_k):
224
+ candidate = int(doc_scores[i].argmax())
225
+ candidate_tree_id = self.cfg["labels"][candidate]
226
+ form: str = self._get_true_cased_form_of_token(token)
227
+ if self.trees.apply(candidate_tree_id, form) is not None:
228
+ doc_compat_guesses.append(candidate_tree_id)
229
+ break
230
+ doc_scores[i, candidate] = np.finfo(np.float32).min
231
+ else:
232
+ doc_compat_guesses.append(-1)
233
+ guesses.append(np.array(doc_compat_guesses))
234
+
235
+ return guesses
236
+
237
+ def _scores2guesses_top_k_guardrail(self, docs, scores):
238
+ guesses = []
239
+ for doc, doc_scores in zip(docs, scores):
240
+ doc_guesses = np.argsort(doc_scores)[..., : -self.top_k - 1 : -1]
241
+ doc_guesses = self.numpy_ops.asarray(doc_guesses)
242
+
243
+ doc_compat_guesses = []
244
+ for token, candidates in zip(doc, doc_guesses):
245
+ tree_id = -1
246
+ for candidate in candidates:
247
+ candidate_tree_id = self.cfg["labels"][candidate]
248
+
249
+ form: str = self._get_true_cased_form_of_token(token)
250
+
251
+ if self.trees.apply(candidate_tree_id, form) is not None:
252
+ tree_id = candidate_tree_id
253
+ break
254
+ doc_compat_guesses.append(tree_id)
255
+
256
+ guesses.append(np.array(doc_compat_guesses))
257
+
258
+ return guesses
259
+
260
+ def set_annotations(self, docs: Iterable[Doc], batch_tree_ids):
261
+ for i, doc in enumerate(docs):
262
+ doc_tree_ids = batch_tree_ids[i]
263
+ if hasattr(doc_tree_ids, "get"):
264
+ doc_tree_ids = doc_tree_ids.get()
265
+ for j, tree_id in enumerate(doc_tree_ids):
266
+ if self.overwrite or doc[j].lemma == 0:
267
+ # If no applicable tree could be found during prediction,
268
+ # the special identifier -1 is used. Otherwise the tree
269
+ # is guaranteed to be applicable.
270
+ if tree_id == -1:
271
+ if self.backoff is not None:
272
+ doc[j].lemma = getattr(doc[j], self.backoff)
273
+ else:
274
+ form = self._get_true_cased_form_of_token(doc[j])
275
+ lemma = self.trees.apply(tree_id, form) or form
276
+ # debug(f"@set_annotations: {doc[j]}/{doc[j].pos_}[{doc[j].is_sent_start}]->{form}|{lemma}[{tree_id}]")
277
+ doc[j].lemma_ = lemma
278
+
279
+ @property
280
+ def labels(self) -> Tuple[int, ...]:
281
+ """Returns the labels currently added to the component."""
282
+ return tuple(self.cfg["labels"])
283
+
284
+ @property
285
+ def hide_labels(self) -> bool:
286
+ return True
287
+
288
+ @property
289
+ def label_data(self) -> Dict:
290
+ trees = []
291
+ for tree_id in range(len(self.trees)):
292
+ tree = self.trees[tree_id]
293
+ if "orig" in tree:
294
+ tree["orig"] = self.vocab.strings[tree["orig"]]
295
+ if "subst" in tree:
296
+ tree["subst"] = self.vocab.strings[tree["subst"]]
297
+ trees.append(tree)
298
+ return dict(trees=trees, labels=tuple(self.cfg["labels"]))
299
+
300
+ def initialize(
301
+ self,
302
+ get_examples: Callable[[], Iterable[Example]],
303
+ *,
304
+ nlp: Optional[Language] = None,
305
+ labels: Optional[Dict] = None,
306
+ ):
307
+ validate_get_examples(get_examples, "EditTreeLemmatizer.initialize")
308
+
309
+ if self.overwrite_labels:
310
+ if labels is None:
311
+ self._labels_from_data(get_examples)
312
+ else:
313
+ self._add_labels(labels)
314
+
315
+ # Sample for the model.
316
+ doc_sample = []
317
+ label_sample = []
318
+ for example in islice(get_examples(), 10):
319
+ doc_sample.append(example.x)
320
+ gold_labels: List[List[float]] = []
321
+ for token in example.reference:
322
+ if token.lemma == 0:
323
+ gold_label = None
324
+ else:
325
+ gold_label = self._pair2label(token.text, token.lemma_)
326
+
327
+ gold_labels.append(
328
+ [
329
+ 1.0 if label == gold_label else 0.0
330
+ for label in self.cfg["labels"]
331
+ ]
332
+ )
333
+
334
+ gold_labels = cast(Floats2d, gold_labels)
335
+ label_sample.append(self.model.ops.asarray(gold_labels, dtype="float32"))
336
+
337
+ self._require_labels()
338
+ assert len(doc_sample) > 0, Errors.E923.format(name=self.name)
339
+ assert len(label_sample) > 0, Errors.E923.format(name=self.name)
340
+
341
+ self.model.initialize(X=doc_sample, Y=label_sample)
342
+
343
+ def from_bytes(self, bytes_data, *, exclude=tuple()):
344
+ deserializers = {
345
+ "cfg": lambda b: self.cfg.update(srsly.json_loads(b)),
346
+ "model": lambda b: self.model.from_bytes(b),
347
+ "vocab": lambda b: self.vocab.from_bytes(b, exclude=exclude),
348
+ "trees": lambda b: self.trees.from_bytes(b),
349
+ }
350
+
351
+ util.from_bytes(bytes_data, deserializers, exclude)
352
+
353
+ return self
354
+
355
+ def to_bytes(self, *, exclude=tuple()):
356
+ serializers = {
357
+ "cfg": lambda: srsly.json_dumps(self.cfg),
358
+ "model": lambda: self.model.to_bytes(),
359
+ "vocab": lambda: self.vocab.to_bytes(exclude=exclude),
360
+ "trees": lambda: self.trees.to_bytes(),
361
+ }
362
+
363
+ return util.to_bytes(serializers, exclude)
364
+
365
+ def to_disk(self, path, exclude=tuple()):
366
+ path = util.ensure_path(path)
367
+ serializers = {
368
+ "cfg": lambda p: srsly.write_json(p, self.cfg),
369
+ "model": lambda p: self.model.to_disk(p),
370
+ "vocab": lambda p: self.vocab.to_disk(p, exclude=exclude),
371
+ "trees": lambda p: self.trees.to_disk(p),
372
+ }
373
+ util.to_disk(path, serializers, exclude)
374
+
375
+ def from_disk(self, path, exclude=tuple()):
376
+ def load_model(p):
377
+ try:
378
+ with open(p, "rb") as mfile:
379
+ self.model.from_bytes(mfile.read())
380
+ except AttributeError:
381
+ raise ValueError(Errors.E149) from None
382
+
383
+ deserializers = {
384
+ "cfg": lambda p: self.cfg.update(srsly.read_json(p)),
385
+ "model": load_model,
386
+ "vocab": lambda p: self.vocab.from_disk(p, exclude=exclude),
387
+ "trees": lambda p: self.trees.from_disk(p),
388
+ }
389
+
390
+ util.from_disk(path, deserializers, exclude)
391
+ return self
392
+
393
+ def _add_labels(self, labels: Dict):
394
+ if "labels" not in labels:
395
+ raise ValueError(Errors.E857.format(name="labels"))
396
+ if "trees" not in labels:
397
+ raise ValueError(Errors.E857.format(name="trees"))
398
+
399
+ self.cfg["labels"] = list(labels["labels"])
400
+ trees = []
401
+ for tree in labels["trees"]:
402
+ errors = validate_edit_tree(tree)
403
+ if errors:
404
+ raise ValueError(Errors.E1026.format(errors="\n".join(errors)))
405
+
406
+ tree = dict(tree)
407
+ if "orig" in tree:
408
+ tree["orig"] = self.vocab.strings[tree["orig"]]
409
+ if "orig" in tree:
410
+ tree["subst"] = self.vocab.strings[tree["subst"]]
411
+
412
+ trees.append(tree)
413
+
414
+ self.trees.from_json(trees)
415
+
416
+ for label, tree in enumerate(self.labels):
417
+ self.tree2label[tree] = label
418
+
419
+ def _labels_from_data(self, get_examples: Callable[[], Iterable[Example]]):
420
+ # Count corpus tree frequencies in ad-hoc storage to avoid cluttering
421
+ # the final pipe/string store.
422
+ vocab = Vocab()
423
+ trees = EditTrees(vocab.strings)
424
+ tree_freqs: Counter = Counter()
425
+ repr_pairs: Dict = {}
426
+ for example in get_examples():
427
+ for token in example.reference:
428
+ if token.lemma != 0:
429
+ form = self._get_true_cased_form_of_token(token)
430
+ # debug("_labels_from_data", str(token) + "->" + form, token.lemma_)
431
+ tree_id = trees.add(form, token.lemma_)
432
+ tree_freqs[tree_id] += 1
433
+ repr_pairs[tree_id] = (form, token.lemma_)
434
+
435
+ # Construct trees that make the frequency cut-off using representative
436
+ # form - token pairs.
437
+ for tree_id, freq in tree_freqs.items():
438
+ if freq >= self.min_tree_freq:
439
+ form, lemma = repr_pairs[tree_id]
440
+ self._pair2label(form, lemma, add_label=True)
441
+
442
+ @lru_cache()
443
+ def _get_true_cased_form(self, token: str, is_sent_start: bool, pos: str) -> str:
444
+ if is_sent_start and pos != "PROPN":
445
+ return token.lower()
446
+ else:
447
+ return token
448
+
449
+ def _get_true_cased_form_of_token(self, token: Token) -> str:
450
+ return self._get_true_cased_form(token.text, token.is_sent_start, token.pos_)
451
+
452
+ def _pair2label(self, form, lemma, add_label=False):
453
+ """
454
+ Look up the edit tree identifier for a form/label pair. If the edit
455
+ tree is unknown and "add_label" is set, the edit tree will be added to
456
+ the labels.
457
+ """
458
+ tree_id = self.trees.add(form, lemma)
459
+ if tree_id not in self.tree2label:
460
+ if not add_label:
461
+ return None
462
+
463
+ self.tree2label[tree_id] = len(self.cfg["labels"])
464
+ self.cfg["labels"].append(tree_id)
465
+ return self.tree2label[tree_id]
hu_core_news_lg-any-py3-none-any.whl CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:68356766408dd914bc61b88be6ef02c4c237fb979b9e107835aa1928261d0bd6
3
- size 401362147
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:266ecaf2bc079609a5f8d8438e98d690b8a63e560f0a8d5af1bfb8ce24a9ff02
3
+ size 401249360
lemma_postprocessing.py ADDED
@@ -0,0 +1,113 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ This module contains various rule-based components aiming to improve on baseline lemmatization tools.
3
+ """
4
+
5
+ import re
6
+ from typing import List, Callable
7
+
8
+ from spacy.lang.hu import Hungarian
9
+ from spacy.pipeline import Pipe
10
+ from spacy.tokens import Token
11
+ from spacy.tokens.doc import Doc
12
+
13
+
14
+ @Hungarian.component(
15
+ "lemma_case_smoother",
16
+ assigns=["token.lemma"],
17
+ requires=["token.lemma", "token.pos"],
18
+ )
19
+ def lemma_case_smoother(doc: Doc) -> Doc:
20
+ """Smooth lemma casing by POS.
21
+
22
+ DEPRECATED: This is not needed anymore, as the lemmatizer is now case-insensitive.
23
+
24
+ Args:
25
+ doc (Doc): Input document.
26
+
27
+ Returns:
28
+ Doc: Output document.
29
+ """
30
+ for token in doc:
31
+ if token.is_sent_start and token.tag_ != "PROPN":
32
+ token.lemma_ = token.lemma_.lower()
33
+
34
+ return doc
35
+
36
+
37
+ class LemmaSmoother(Pipe):
38
+ """Smooths lemma by fixing common errors of the edit-tree lemmatizer."""
39
+
40
+ _DATE_PATTERN = re.compile(r"(\d+)-j?[éá]?n?a?(t[őó]l)?")
41
+ _NUMBER_PATTERN = re.compile(r"(\d+([-,/_.:]?(._)?\d+)*%?)")
42
+
43
+ # noinspection PyUnusedLocal
44
+ @staticmethod
45
+ @Hungarian.factory("lemma_smoother", assigns=["token.lemma"], requires=["token.lemma", "token.pos"])
46
+ def create_lemma_smoother(nlp: Hungarian, name: str) -> "LemmaSmoother":
47
+ return LemmaSmoother()
48
+
49
+ def __call__(self, doc: Doc) -> Doc:
50
+ rules: List[Callable] = [
51
+ self._remove_exclamation_marks,
52
+ self._remove_question_marks,
53
+ self._remove_date_suffixes,
54
+ self._remove_suffix_after_numbers,
55
+ ]
56
+
57
+ for token in doc:
58
+ for rule in rules:
59
+ rule(token)
60
+
61
+ return doc
62
+
63
+ @classmethod
64
+ def _remove_exclamation_marks(cls, token: Token) -> None:
65
+ """Removes exclamation marks from the lemma.
66
+
67
+ Args:
68
+ token (Token): The original token.
69
+ """
70
+
71
+ if "!" != token.lemma_:
72
+ exclamation_mark_index = token.lemma_.find("!")
73
+ if exclamation_mark_index != -1:
74
+ token.lemma_ = token.lemma_[:exclamation_mark_index]
75
+
76
+ @classmethod
77
+ def _remove_question_marks(cls, token: Token) -> None:
78
+ """Removes question marks from the lemma.
79
+
80
+ Args:
81
+ token (Token): The original token.
82
+ """
83
+
84
+ if "?" != token.lemma_:
85
+ question_mark_index = token.lemma_.find("?")
86
+ if question_mark_index != -1:
87
+ token.lemma_ = token.lemma_[:question_mark_index]
88
+
89
+ @classmethod
90
+ def _remove_date_suffixes(cls, token: Token) -> None:
91
+ """Fixes the suffixes of dates.
92
+
93
+ Args:
94
+ token (Token): The original token.
95
+ """
96
+
97
+ if token.pos_ == "NOUN":
98
+ match = cls._DATE_PATTERN.match(token.lemma_)
99
+ if match is not None:
100
+ token.lemma_ = match.group(1) + "."
101
+
102
+ @classmethod
103
+ def _remove_suffix_after_numbers(cls, token: Token) -> None:
104
+ """Removes suffixes after numbers.
105
+
106
+ Args:
107
+ token (str): The original token.
108
+ """
109
+
110
+ if token.pos_ == "NUM":
111
+ match = cls._NUMBER_PATTERN.match(token.text)
112
+ if match is not None:
113
+ token.lemma_ = match.group(0)
lemmatizer/model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:bc78b274c26afb6cdc046ef08e700eadd5ac67afccfb637f3c9fcdeda2d2f8d3
3
  size 61643136
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e0e69dbbfcec14a02ecead9533398c131c27b19821afddf82abfe54f014452c6
3
  size 61643136
lookup_lemmatizer.py ADDED
@@ -0,0 +1,132 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import re
2
+ from collections import defaultdict
3
+ from operator import itemgetter
4
+ from pathlib import Path
5
+ from re import Pattern
6
+ from typing import Optional, Callable, Iterable, Dict, Tuple
7
+
8
+ from spacy.lang.hu import Hungarian
9
+ from spacy.language import Language
10
+ from spacy.lookups import Lookups, Table
11
+ from spacy.pipeline import Pipe
12
+ from spacy.pipeline.lemmatizer import lemmatizer_score
13
+ from spacy.tokens import Token
14
+ from spacy.tokens.doc import Doc
15
+
16
+ # noinspection PyUnresolvedReferences
17
+ from spacy.training.example import Example
18
+ from spacy.util import ensure_path
19
+
20
+
21
+ class LookupLemmatizer(Pipe):
22
+ """
23
+ LookupLemmatizer learn `(token, pos, morph. feat) -> lemma` mappings during training, and applies them at prediction
24
+ time.
25
+ """
26
+
27
+ _number_pattern: Pattern = re.compile(r"\d")
28
+
29
+ # noinspection PyUnusedLocal
30
+ @staticmethod
31
+ @Hungarian.factory(
32
+ "lookup_lemmatizer",
33
+ assigns=["token.lemma"],
34
+ requires=["token.pos"],
35
+ default_config={"scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"}, "source": ""},
36
+ )
37
+ def create(nlp: Language, name: str, scorer: Optional[Callable], source: str) -> "LookupLemmatizer":
38
+ return LookupLemmatizer(None, source, scorer)
39
+
40
+ def train(self, sentences: Iterable[Iterable[Tuple[str, str, str, str]]], min_occurrences: int = 1) -> None:
41
+ """
42
+
43
+ Args:
44
+ sentences (Iterable[Iterable[Tuple[str, str, str, str]]]): Sentences to learn the mappings from
45
+ min_occurrences (int): mapping occurring less than this threshold are not learned
46
+
47
+ """
48
+
49
+ # Lookup table which maps (upos, form) to (lemma -> frequency),
50
+ # e.g. `{ ("NOUN", "alma"): { "alma" : 99, "alom": 1} }`
51
+ lemma_lookup_table: Dict[Tuple[str, str], Dict[str, int]] = defaultdict(lambda: defaultdict(int))
52
+
53
+ for sentence in sentences:
54
+ for token, pos, feats, lemma in sentence:
55
+ token = self.__mask_numbers(token)
56
+ lemma = self.__mask_numbers(lemma)
57
+ feats_str = ("|" + feats) if feats else ""
58
+ key = (token, pos + feats_str)
59
+ lemma_lookup_table[key][lemma] += 1
60
+ lemma_lookup_table = dict(lemma_lookup_table)
61
+
62
+ self._lookups = Lookups()
63
+ table = Table(name="lemma_lookups")
64
+
65
+ lemma_freq: Dict[str, int]
66
+ for (form, pos), lemma_freq in dict(lemma_lookup_table).items():
67
+ most_freq_lemma, freq = sorted(lemma_freq.items(), key=itemgetter(1), reverse=True)[0]
68
+ if freq >= min_occurrences:
69
+ if form not in table:
70
+ # lemma by pos
71
+ table[form]: Dict[str, str] = dict()
72
+ table[form][pos] = most_freq_lemma
73
+
74
+ self._lookups.set_table(name=f"lemma_lookups", table=table)
75
+
76
+ def __init__(
77
+ self,
78
+ lookups: Optional[Lookups] = None,
79
+ source: Optional[str] = None,
80
+ scorer: Optional[Callable] = lemmatizer_score,
81
+ ):
82
+ self._lookups: Optional[Lookups] = lookups
83
+ self.scorer = scorer
84
+ self.source = source
85
+
86
+ def __call__(self, doc: Doc) -> Doc:
87
+ assert self._lookups is not None, "Lookup table should be initialized first"
88
+
89
+ token: Token
90
+ for token in doc:
91
+ lemma_lookup_table = self._lookups.get_table(f"lemma_lookups")
92
+ masked_token = self.__mask_numbers(token.text)
93
+
94
+ if masked_token in lemma_lookup_table:
95
+ lemma_by_pos: Dict[str, str] = lemma_lookup_table[masked_token]
96
+ feats_str = ("|" + str(token.morph)) if str(token.morph) else ""
97
+ key = token.pos_ + feats_str
98
+ if key in lemma_by_pos:
99
+ if masked_token != token.text:
100
+ # If the token contains numbers, we need to replace the numbers in the lemma as well
101
+ token.lemma_ = self.__replace_numbers(lemma_by_pos[key], token.text)
102
+ pass
103
+ else:
104
+ token.lemma_ = lemma_by_pos[key]
105
+ return doc
106
+
107
+ # noinspection PyUnusedLocal
108
+ def to_disk(self, path, exclude=tuple()):
109
+ assert self._lookups is not None, "Lookup table should be initialized first"
110
+
111
+ path: Path = ensure_path(path)
112
+ path.mkdir(exist_ok=True)
113
+ self._lookups.to_disk(path)
114
+
115
+ # noinspection PyUnusedLocal
116
+ def from_disk(self, path, exclude=tuple()) -> "LookupLemmatizer":
117
+ path: Path = ensure_path(path)
118
+ lookups = Lookups()
119
+ self._lookups = lookups.from_disk(path=path)
120
+ return self
121
+
122
+ def initialize(self, get_examples: Callable[[], Iterable[Example]], *, nlp: Language = None) -> None:
123
+ lookups = Lookups()
124
+ self._lookups = lookups.from_disk(path=self.source)
125
+
126
+ @classmethod
127
+ def __mask_numbers(cls, token: str) -> str:
128
+ return cls._number_pattern.sub("0", token)
129
+
130
+ @classmethod
131
+ def __replace_numbers(cls, lemma: str, token: str) -> str:
132
+ return cls._number_pattern.sub(lambda match: token[match.start()], lemma)
meta.json CHANGED
@@ -1,7 +1,7 @@
1
  {
2
  "lang":"hu",
3
  "name":"core_news_lg",
4
- "version":"3.5.0",
5
  "description":"Core Hungarian model for HuSpaCy. Components: tok2vec, senter, tagger, morphologizer, lemmatizer, parser, ner",
6
  "author":"SzegedAI, MILAB",
7
  "email":"gyorgy@orosz.link",
@@ -1273,85 +1273,85 @@
1273
  "token_p":0.998565417,
1274
  "token_r":0.9993300153,
1275
  "token_f":0.9989475698,
1276
- "sents_p":0.9819413093,
1277
- "sents_r":0.9688195991,
1278
- "sents_f":0.9753363229,
1279
- "tag_acc":0.9643028041,
1280
- "pos_acc":0.9634414777,
1281
- "morph_acc":0.9310938846,
1282
- "morph_micro_p":0.9679604064,
1283
- "morph_micro_r":0.9581435324,
1284
- "morph_micro_f":0.9630269523,
1285
  "morph_per_feat":{
1286
  "Definite":{
1287
- "p":0.9642201835,
1288
- "r":0.9808679421,
1289
- "f":0.9724728198
1290
  },
1291
  "PronType":{
1292
- "p":0.971869829,
1293
- "r":0.972406181,
1294
- "f":0.972137931
1295
  },
1296
  "Case":{
1297
- "p":0.9724220624,
1298
- "r":0.9614700652,
1299
- "f":0.9669150522
1300
  },
1301
  "Degree":{
1302
- "p":0.9230072464,
1303
- "r":0.8477537438,
1304
- "f":0.8837814397
1305
  },
1306
  "Number":{
1307
- "p":0.9844515802,
1308
- "r":0.9762024468,
1309
- "f":0.98030966
1310
  },
1311
  "Mood":{
1312
- "p":0.9429198683,
1313
- "r":0.9523281596,
1314
- "f":0.9476006619
1315
  },
1316
  "Person":{
1317
- "p":0.9542429285,
1318
- "r":0.9432565789,
1319
- "f":0.9487179487
1320
  },
1321
  "Tense":{
1322
- "p":0.9660087719,
1323
- "r":0.973480663,
1324
- "f":0.9697303247
1325
  },
1326
  "VerbForm":{
1327
- "p":0.9516393443,
1328
- "r":0.9310344828,
1329
- "f":0.9412241589
1330
  },
1331
  "Voice":{
1332
- "p":0.9615773509,
1333
- "r":0.972392638,
1334
- "f":0.9669547534
1335
  },
1336
  "Number[psor]":{
1337
- "p":0.9737609329,
1338
- "r":0.9515669516,
1339
- "f":0.9625360231
1340
  },
1341
  "Person[psor]":{
1342
- "p":0.9752186589,
1343
- "r":0.9543509272,
1344
- "f":0.9646719539
1345
  },
1346
  "NumType":{
1347
- "p":0.9423558897,
1348
- "r":0.9170731707,
1349
- "f":0.9295426452
1350
  },
1351
  "Poss":{
1352
- "p":0.75,
1353
  "r":1.0,
1354
- "f":0.8571428571
1355
  },
1356
  "Reflex":{
1357
  "p":1.0,
@@ -1359,9 +1359,9 @@
1359
  "f":0.9333333333
1360
  },
1361
  "Aspect":{
1362
- "p":0.0,
1363
- "r":0.0,
1364
- "f":0.0
1365
  },
1366
  "Number[psed]":{
1367
  "p":0.0,
@@ -1369,114 +1369,114 @@
1369
  "f":0.0
1370
  }
1371
  },
1372
- "lemma_acc":0.9722514592,
1373
- "dep_uas":0.8222334626,
1374
- "dep_las":0.75479121,
1375
  "dep_las_per_type":{
1376
  "det":{
1377
- "p":0.8744149766,
1378
- "r":0.8925159236,
1379
- "f":0.8833727344
1380
  },
1381
  "amod:att":{
1382
- "p":0.8396150762,
1383
  "r":0.8560915781,
1384
- "f":0.8477732794
1385
  },
1386
  "nsubj":{
1387
- "p":0.7182890855,
1388
- "r":0.7609375,
1389
- "f":0.7389984825
1390
  },
1391
  "advmod:mode":{
1392
- "p":0.6243523316,
1393
- "r":0.5906862745,
1394
- "f":0.6070528967
1395
  },
1396
  "nmod:att":{
1397
- "p":0.7721943049,
1398
- "r":0.7813559322,
1399
- "f":0.7767481045
1400
  },
1401
  "obl":{
1402
- "p":0.8051575931,
1403
- "r":0.7587758776,
1404
- "f":0.781278962
1405
  },
1406
  "obj":{
1407
- "p":0.8633257403,
1408
- "r":0.8516853933,
1409
- "f":0.8574660633
1410
  },
1411
  "root":{
1412
- "p":0.7968397291,
1413
- "r":0.7861915367,
1414
- "f":0.7914798206
1415
  },
1416
  "cc":{
1417
- "p":0.7083333333,
1418
- "r":0.68,
1419
- "f":0.693877551
1420
  },
1421
  "conj":{
1422
- "p":0.5010799136,
1423
- "r":0.4833333333,
1424
- "f":0.4920466596
1425
  },
1426
  "advmod":{
1427
- "p":0.7884615385,
1428
- "r":0.8631578947,
1429
- "f":0.824120603
1430
  },
1431
  "flat:name":{
1432
- "p":0.850678733,
1433
- "r":0.8785046729,
1434
- "f":0.8643678161
1435
  },
1436
  "appos":{
1437
- "p":0.3428571429,
1438
  "r":0.3829787234,
1439
- "f":0.3618090452
1440
  },
1441
  "advcl":{
1442
- "p":0.2909090909,
1443
- "r":0.3265306122,
1444
- "f":0.3076923077
1445
  },
1446
  "advmod:tlocy":{
1447
- "p":0.7136929461,
1448
- "r":0.747826087,
1449
- "f":0.7303609342
1450
  },
1451
  "ccomp:obj":{
1452
- "p":0.34375,
1453
- "r":0.3333333333,
1454
- "f":0.3384615385
1455
  },
1456
  "mark":{
1457
- "p":0.8481012658,
1458
- "r":0.8481012658,
1459
- "f":0.8481012658
1460
  },
1461
  "compound:preverb":{
1462
- "p":0.8859649123,
1463
  "r":0.9266055046,
1464
- "f":0.9058295964
1465
  },
1466
  "advmod:locy":{
1467
- "p":0.8333333333,
1468
- "r":0.46875,
1469
- "f":0.6
1470
  },
1471
  "cop":{
1472
- "p":0.7567567568,
1473
- "r":0.6829268293,
1474
- "f":0.7179487179
1475
  },
1476
  "nmod:obl":{
1477
- "p":0.1739130435,
1478
- "r":0.1,
1479
- "f":0.126984127
1480
  },
1481
  "advmod:to":{
1482
  "p":0.0,
@@ -1484,69 +1484,69 @@
1484
  "f":0.0
1485
  },
1486
  "obj:lvc":{
1487
- "p":0.5,
1488
- "r":0.0833333333,
1489
- "f":0.1428571429
1490
  },
1491
  "ccomp:obl":{
1492
- "p":0.6086956522,
1493
- "r":0.4375,
1494
- "f":0.5090909091
1495
  },
1496
  "iobj":{
1497
- "p":0.2941176471,
1498
- "r":0.3333333333,
1499
- "f":0.3125
1500
- },
1501
- "case":{
1502
- "p":0.942408377,
1503
- "r":0.9183673469,
1504
- "f":0.9302325581
1505
  },
1506
  "csubj":{
1507
- "p":0.6666666667,
1508
- "r":0.3783783784,
1509
- "f":0.4827586207
 
 
 
 
 
1510
  },
1511
  "parataxis":{
1512
- "p":0.0454545455,
1513
- "r":0.0136986301,
1514
- "f":0.0210526316
1515
  },
1516
  "xcomp":{
1517
- "p":0.9,
1518
- "r":0.8513513514,
1519
- "f":0.875
1520
  },
1521
  "nummod":{
1522
- "p":0.5943396226,
1523
- "r":0.6774193548,
1524
- "f":0.6331658291
1525
- },
1526
- "acl":{
1527
- "p":0.4057971014,
1528
- "r":0.3888888889,
1529
- "f":0.3971631206
1530
  },
1531
  "dep":{
1532
  "p":0.0,
1533
  "r":0.0,
1534
  "f":0.0
1535
  },
 
 
 
 
 
1536
  "advmod:tto":{
1537
- "p":0.4545454545,
1538
- "r":0.5,
1539
- "f":0.4761904762
1540
  },
1541
  "nmod":{
1542
- "p":0.6,
1543
- "r":0.2727272727,
1544
- "f":0.375
1545
  },
1546
  "aux":{
1547
- "p":0.8571428571,
1548
- "r":0.5,
1549
- "f":0.6315789474
1550
  },
1551
  "advmod:tfrom":{
1552
  "p":0.0,
@@ -1559,9 +1559,9 @@
1559
  "f":0.0
1560
  },
1561
  "compound":{
1562
- "p":0.95,
1563
- "r":0.95,
1564
- "f":0.95
1565
  },
1566
  "obl:lvc":{
1567
  "p":0.0,
@@ -1579,9 +1579,9 @@
1579
  "f":0.0
1580
  },
1581
  "list":{
1582
- "p":0.2222222222,
1583
- "r":0.3333333333,
1584
- "f":0.2666666667
1585
  },
1586
  "ccomp":{
1587
  "p":0.0,
@@ -1599,32 +1599,32 @@
1599
  "f":0.0
1600
  }
1601
  },
1602
- "ents_p":0.8662957645,
1603
- "ents_r":0.848628692,
1604
- "ents_f":0.8573712256,
1605
  "ents_per_type":{
1606
  "ORG":{
1607
- "p":0.8850889193,
1608
- "r":0.8998609179,
1609
- "f":0.8924137931
1610
  },
1611
  "PER":{
1612
- "p":0.8915009042,
1613
- "r":0.8835125448,
1614
- "f":0.8874887489
1615
  },
1616
  "LOC":{
1617
- "p":0.9098922625,
1618
- "r":0.8064236111,
1619
- "f":0.8550391164
1620
  },
1621
  "MISC":{
1622
- "p":0.6838340486,
1623
- "r":0.6780141844,
1624
- "f":0.6809116809
1625
  }
1626
  },
1627
- "speed":757.2485282534
1628
  },
1629
  "sources":[
1630
  {
1
  {
2
  "lang":"hu",
3
  "name":"core_news_lg",
4
+ "version":"3.5.1",
5
  "description":"Core Hungarian model for HuSpaCy. Components: tok2vec, senter, tagger, morphologizer, lemmatizer, parser, ner",
6
  "author":"SzegedAI, MILAB",
7
  "email":"gyorgy@orosz.link",
1273
  "token_p":0.998565417,
1274
  "token_r":0.9993300153,
1275
  "token_f":0.9989475698,
1276
+ "sents_p":0.984375,
1277
+ "sents_r":0.9821826281,
1278
+ "sents_f":0.983277592,
1279
+ "tag_acc":0.9680845973,
1280
+ "pos_acc":0.9686587875,
1281
+ "morph_acc":0.9363127422,
1282
+ "morph_micro_p":0.9693092418,
1283
+ "morph_micro_r":0.9636441771,
1284
+ "morph_micro_f":0.9664684079,
1285
  "morph_per_feat":{
1286
  "Definite":{
1287
+ "p":0.9693877551,
1288
+ "r":0.9752683154,
1289
+ "f":0.972319144
1290
  },
1291
  "PronType":{
1292
+ "p":0.9778516058,
1293
+ "r":0.9746136865,
1294
+ "f":0.9762299613
1295
  },
1296
  "Case":{
1297
+ "p":0.9743895176,
1298
+ "r":0.9697688204,
1299
+ "f":0.972073678
1300
  },
1301
  "Degree":{
1302
+ "p":0.914507772,
1303
+ "r":0.881031614,
1304
+ "f":0.8974576271
1305
  },
1306
  "Number":{
1307
+ "p":0.9877475663,
1308
+ "r":0.986257751,
1309
+ "f":0.9870020964
1310
  },
1311
  "Mood":{
1312
+ "p":0.9290393013,
1313
+ "r":0.94345898,
1314
+ "f":0.9361936194
1315
  },
1316
  "Person":{
1317
+ "p":0.9529220779,
1318
+ "r":0.9654605263,
1319
+ "f":0.9591503268
1320
  },
1321
  "Tense":{
1322
+ "p":0.9628820961,
1323
+ "r":0.9745856354,
1324
+ "f":0.9686985173
1325
  },
1326
  "VerbForm":{
1327
+ "p":0.9615713066,
1328
+ "r":0.9029671211,
1329
+ "f":0.9313482217
1330
  },
1331
  "Voice":{
1332
+ "p":0.9576612903,
1333
+ "r":0.9713701431,
1334
+ "f":0.9644670051
1335
  },
1336
  "Number[psor]":{
1337
+ "p":0.9852724595,
1338
+ "r":0.952991453,
1339
+ "f":0.9688631427
1340
  },
1341
  "Person[psor]":{
1342
+ "p":0.9867452135,
1343
+ "r":0.9557774608,
1344
+ "f":0.9710144928
1345
  },
1346
  "NumType":{
1347
+ "p":0.9097387173,
1348
+ "r":0.9341463415,
1349
+ "f":0.9217809868
1350
  },
1351
  "Poss":{
1352
+ "p":0.6,
1353
  "r":1.0,
1354
+ "f":0.75
1355
  },
1356
  "Reflex":{
1357
  "p":1.0,
1359
  "f":0.9333333333
1360
  },
1361
  "Aspect":{
1362
+ "p":1.0,
1363
+ "r":0.25,
1364
+ "f":0.4
1365
  },
1366
  "Number[psed]":{
1367
  "p":0.0,
1369
  "f":0.0
1370
  }
1371
  },
1372
+ "lemma_acc":0.9747392594,
1373
+ "dep_uas":0.8158633861,
1374
+ "dep_las":0.7489046175,
1375
  "dep_las_per_type":{
1376
  "det":{
1377
+ "p":0.8498452012,
1378
+ "r":0.8742038217,
1379
+ "f":0.8618524333
1380
  },
1381
  "amod:att":{
1382
+ "p":0.8512195122,
1383
  "r":0.8560915781,
1384
+ "f":0.8536485936
1385
  },
1386
  "nsubj":{
1387
+ "p":0.7018813314,
1388
+ "r":0.7578125,
1389
+ "f":0.7287753569
1390
  },
1391
  "advmod:mode":{
1392
+ "p":0.5764705882,
1393
+ "r":0.6004901961,
1394
+ "f":0.5882352941
1395
  },
1396
  "nmod:att":{
1397
+ "p":0.7673267327,
1398
+ "r":0.7881355932,
1399
+ "f":0.7775919732
1400
  },
1401
  "obl":{
1402
+ "p":0.7942583732,
1403
+ "r":0.7470747075,
1404
+ "f":0.7699443414
1405
  },
1406
  "obj":{
1407
+ "p":0.8322295806,
1408
+ "r":0.8471910112,
1409
+ "f":0.8396436526
1410
  },
1411
  "root":{
1412
+ "p":0.7991071429,
1413
+ "r":0.7973273942,
1414
+ "f":0.7982162765
1415
  },
1416
  "cc":{
1417
+ "p":0.7133479212,
1418
+ "r":0.6863157895,
1419
+ "f":0.6995708155
1420
  },
1421
  "conj":{
1422
+ "p":0.4870775348,
1423
+ "r":0.5104166667,
1424
+ "f":0.498474059
1425
  },
1426
  "advmod":{
1427
+ "p":0.8235294118,
1428
+ "r":0.8842105263,
1429
+ "f":0.8527918782
1430
  },
1431
  "flat:name":{
1432
+ "p":0.9103773585,
1433
+ "r":0.9018691589,
1434
+ "f":0.9061032864
1435
  },
1436
  "appos":{
1437
+ "p":0.45,
1438
  "r":0.3829787234,
1439
+ "f":0.4137931034
1440
  },
1441
  "advcl":{
1442
+ "p":0.297029703,
1443
+ "r":0.306122449,
1444
+ "f":0.3015075377
1445
  },
1446
  "advmod:tlocy":{
1447
+ "p":0.7222222222,
1448
+ "r":0.6782608696,
1449
+ "f":0.6995515695
1450
  },
1451
  "ccomp:obj":{
1452
+ "p":0.3111111111,
1453
+ "r":0.4242424242,
1454
+ "f":0.358974359
1455
  },
1456
  "mark":{
1457
+ "p":0.8246753247,
1458
+ "r":0.8037974684,
1459
+ "f":0.8141025641
1460
  },
1461
  "compound:preverb":{
1462
+ "p":0.9439252336,
1463
  "r":0.9266055046,
1464
+ "f":0.9351851852
1465
  },
1466
  "advmod:locy":{
1467
+ "p":0.72,
1468
+ "r":0.5625,
1469
+ "f":0.6315789474
1470
  },
1471
  "cop":{
1472
+ "p":0.8636363636,
1473
+ "r":0.4634146341,
1474
+ "f":0.6031746032
1475
  },
1476
  "nmod:obl":{
1477
+ "p":0.3125,
1478
+ "r":0.25,
1479
+ "f":0.2777777778
1480
  },
1481
  "advmod:to":{
1482
  "p":0.0,
1484
  "f":0.0
1485
  },
1486
  "obj:lvc":{
1487
+ "p":0.0,
1488
+ "r":0.0,
1489
+ "f":0.0
1490
  },
1491
  "ccomp:obl":{
1492
+ "p":0.6470588235,
1493
+ "r":0.34375,
1494
+ "f":0.4489795918
1495
  },
1496
  "iobj":{
1497
+ "p":0.1818181818,
1498
+ "r":0.2666666667,
1499
+ "f":0.2162162162
 
 
 
 
 
1500
  },
1501
  "csubj":{
1502
+ "p":0.6428571429,
1503
+ "r":0.2432432432,
1504
+ "f":0.3529411765
1505
+ },
1506
+ "case":{
1507
+ "p":0.9059405941,
1508
+ "r":0.9336734694,
1509
+ "f":0.9195979899
1510
  },
1511
  "parataxis":{
1512
+ "p":0.1666666667,
1513
+ "r":0.0410958904,
1514
+ "f":0.0659340659
1515
  },
1516
  "xcomp":{
1517
+ "p":0.8378378378,
1518
+ "r":0.8378378378,
1519
+ "f":0.8378378378
1520
  },
1521
  "nummod":{
1522
+ "p":0.6071428571,
1523
+ "r":0.5483870968,
1524
+ "f":0.5762711864
 
 
 
 
 
1525
  },
1526
  "dep":{
1527
  "p":0.0,
1528
  "r":0.0,
1529
  "f":0.0
1530
  },
1531
+ "acl":{
1532
+ "p":0.3783783784,
1533
+ "r":0.3888888889,
1534
+ "f":0.3835616438
1535
+ },
1536
  "advmod:tto":{
1537
+ "p":0.2,
1538
+ "r":0.1,
1539
+ "f":0.1333333333
1540
  },
1541
  "nmod":{
1542
+ "p":0.2,
1543
+ "r":0.0909090909,
1544
+ "f":0.125
1545
  },
1546
  "aux":{
1547
+ "p":0.875,
1548
+ "r":0.5833333333,
1549
+ "f":0.7
1550
  },
1551
  "advmod:tfrom":{
1552
  "p":0.0,
1559
  "f":0.0
1560
  },
1561
  "compound":{
1562
+ "p":0.9285714286,
1563
+ "r":0.975,
1564
+ "f":0.9512195122
1565
  },
1566
  "obl:lvc":{
1567
  "p":0.0,
1579
  "f":0.0
1580
  },
1581
  "list":{
1582
+ "p":1.0,
1583
+ "r":0.1666666667,
1584
+ "f":0.2857142857
1585
  },
1586
  "ccomp":{
1587
  "p":0.0,
1599
  "f":0.0
1600
  }
1601
  },
1602
+ "ents_p":0.861328125,
1603
+ "ents_r":0.8528481013,
1604
+ "ents_f":0.8570671378,
1605
  "ents_per_type":{
1606
  "ORG":{
1607
+ "p":0.8911439114,
1608
+ "r":0.8956884562,
1609
+ "f":0.8934104046
1610
  },
1611
  "PER":{
1612
+ "p":0.8787346221,
1613
+ "r":0.8960573477,
1614
+ "f":0.8873114463
1615
  },
1616
  "LOC":{
1617
+ "p":0.8728888889,
1618
+ "r":0.8524305556,
1619
+ "f":0.8625384278
1620
  },
1621
  "MISC":{
1622
+ "p":0.6914556962,
1623
+ "r":0.619858156,
1624
+ "f":0.6537023186
1625
  }
1626
  },
1627
+ "speed":877.9572815434
1628
  },
1629
  "sources":[
1630
  {
morphologizer/model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:a57eb9dae96e9bf8e54a9752ba55e6e3912d979a7dcedd0688262c08e9e29fc4
3
  size 1379030
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d5f93099945740c800708e5c5ed5f7b9acefa3122363f4a4e9d09f89ea3e7bc9
3
  size 1379030
ner/model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:349a93f95cb97ee8646a8f145ae82ee106815ea50e02b19ce909155e05f3ac81
3
  size 56989063
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:94bb30ee14bb5ccf6ba9239a594775221b296dc02f803b7828ff46721ecfa749
3
  size 56989063
parser/model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:93ef0b7782b1abcd56ae9bbcfe04056d69c40592f38b586e4a00930411b475b4
3
  size 26010735
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:76fa906db716c185d6ff1b495b845ff804bf1b924407117e2a856e0e6df00a51
3
  size 26010735
senter/model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:09c4e947afb6e7ce0585e255fdb58e3aa725acefdd375ae2886539ee44578908
3
  size 2845
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b53fc92e5cb9031751bbba9c32bbabe686e72181cdbaaeec17a3235b3923e315
3
  size 2845
tagger/model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:49dcfc46103a35c67051f7e925655f3cac46e67ae9d49b3965d8667c28f51911
3
  size 20905
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bfeeaf5292f94e27f75c696e8ded3a8dcd5f4b5747e0cf85eabfb6b88c5b8ce9
3
  size 20905
tok2vec/model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:ac424f6dac3b36b4b913cc44e0f43004ec61fbe122ac27d5971915e155c71816
3
  size 56806299
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:78afb26fb2e04038f881d9ed816644ac1fda5dac73dbc3dfeac9ec37be3869e3
3
  size 56806299
vocab/strings.json CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:fddb9576a688eb5303f9d3eec78385396083f8bd525fa342302245ccea6da82d
3
- size 6402729
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:61f851567cea49829a0db0015d50da8cfab49a4c614fc210d339334ca3a99f34
3
+ size 6404011