Tanor commited on
Commit
43afb9f
1 Parent(s): 7181c50

Update spaCy pipeline

Browse files
.gitattributes CHANGED
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ sr_pner_tesla_j355-any-py3-none-any.whl filter=lfs diff=lfs merge=lfs -text
37
+ transformer/model filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - spacy
4
+ - token-classification
5
+ language:
6
+ - sr
7
+ license: cc-by-sa-3.0
8
+ model-index:
9
+ - name: sr_pner_tesla_j355
10
+ results:
11
+ - task:
12
+ name: NER
13
+ type: token-classification
14
+ metrics:
15
+ - name: NER Precision
16
+ type: precision
17
+ value: 0.9516940624
18
+ - name: NER Recall
19
+ type: recall
20
+ value: 0.9596130429
21
+ - name: NER F Score
22
+ type: f_score
23
+ value: 0.9556371476
24
+ - task:
25
+ name: TAG
26
+ type: token-classification
27
+ metrics:
28
+ - name: TAG (XPOS) Accuracy
29
+ type: accuracy
30
+ value: 0.9841723761
31
+ ---
32
+ sr_pner_tesla_j355 is a spaCy model meticulously fine-tuned for Part-of-Speech Tagging, Lemmatization, and Named Entity Recognition in Serbian language texts. This advanced model incorporates a transformer layer based on Jerteh-355, enhancing its analytical capabilities. It is proficient in identifying 7 distinct categories of entities: PERS (persons), ROLE (professions), DEMO (demonyms), ORG (organizations), LOC (locations), WORK (artworks), and EVENT (events). Detailed information about these categories is available in the accompanying table. The development of this model has been made possible through the support of the Science Fund of the Republic of Serbia, under grant #7276, for the project 'Text Embeddings - Serbian Language Applications - TESLA'.
33
+
34
+ | Feature | Description |
35
+ | --- | --- |
36
+ | **Name** | `sr_pner_tesla_j355` |
37
+ | **Version** | `1.0.0` |
38
+ | **spaCy** | `>=3.7.2,<3.8.0` |
39
+ | **Default Pipeline** | `transformer`, `tagger`, `ner` |
40
+ | **Components** | `transformer`, `tagger`, `ner` |
41
+ | **Vectors** | 0 keys, 0 unique vectors (0 dimensions) |
42
+ | **Sources** | n/a |
43
+ | **License** | `CC BY-SA 3.0` |
44
+ | **Author** | [Milica Ikonić Nešić, Saša Petalinkar, Mihailo Škorić, Ranka Stanković](https://tesla.rgf.bg.ac.rs/) |
45
+
46
+ ### Label Scheme
47
+
48
+ <details>
49
+
50
+ <summary>View label scheme (23 labels for 2 components)</summary>
51
+
52
+ | Component | Labels |
53
+ | --- | --- |
54
+ | **`tagger`** | `ADJ`, `ADP`, `ADV`, `AUX`, `CCONJ`, `DET`, `INTJ`, `NOUN`, `NUM`, `PART`, `PRON`, `PROPN`, `PUNCT`, `SCONJ`, `VERB`, `X` |
55
+ | **`ner`** | `DEMO`, `EVENT`, `LOC`, `ORG`, `PERS`, `ROLE`, `WORK` |
56
+
57
+ </details>
58
+
59
+ ### Accuracy
60
+
61
+ | Type | Score |
62
+ | --- | --- |
63
+ | `TAG_ACC` | 98.42 |
64
+ | `ENTS_F` | 95.56 |
65
+ | `ENTS_P` | 95.17 |
66
+ | `ENTS_R` | 95.96 |
67
+ | `TRANSFORMER_LOSS` | 151439.86 |
68
+ | `TAGGER_LOSS` | 141230.81 |
69
+ | `NER_LOSS` | 84043.38 |
config.cfg ADDED
@@ -0,0 +1,167 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [paths]
2
+ train = "./train.spacy"
3
+ dev = "./dev.spacy"
4
+ vectors = null
5
+ bert = "E:\\scratch2lm\\bert modeli\\jerteh-355"
6
+ init_tok2vec = null
7
+
8
+ [system]
9
+ gpu_allocator = "pytorch"
10
+ seed = 0
11
+
12
+ [nlp]
13
+ lang = "sr"
14
+ pipeline = ["transformer","tagger","ner"]
15
+ batch_size = 128
16
+ disabled = []
17
+ before_creation = null
18
+ after_creation = null
19
+ after_pipeline_creation = null
20
+ tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
21
+ vectors = {"@vectors":"spacy.Vectors.v1"}
22
+
23
+ [components]
24
+
25
+ [components.ner]
26
+ factory = "ner"
27
+ incorrect_spans_key = null
28
+ moves = null
29
+ scorer = {"@scorers":"spacy.ner_scorer.v1"}
30
+ update_with_oracle_cut_size = 100
31
+
32
+ [components.ner.model]
33
+ @architectures = "spacy.TransitionBasedParser.v2"
34
+ state_type = "ner"
35
+ extra_state_tokens = false
36
+ hidden_width = 64
37
+ maxout_pieces = 2
38
+ use_upper = false
39
+ nO = null
40
+
41
+ [components.ner.model.tok2vec]
42
+ @architectures = "spacy-transformers.TransformerListener.v1"
43
+ grad_factor = 1.0
44
+ pooling = {"@layers":"reduce_mean.v1"}
45
+ upstream = "*"
46
+
47
+ [components.tagger]
48
+ factory = "tagger"
49
+ label_smoothing = 0.0
50
+ neg_prefix = "!"
51
+ overwrite = false
52
+ scorer = {"@scorers":"spacy.tagger_scorer.v1"}
53
+
54
+ [components.tagger.model]
55
+ @architectures = "spacy.Tagger.v2"
56
+ nO = null
57
+ normalize = false
58
+
59
+ [components.tagger.model.tok2vec]
60
+ @architectures = "spacy-transformers.TransformerListener.v1"
61
+ grad_factor = 1.0
62
+ pooling = {"@layers":"reduce_mean.v1"}
63
+ upstream = "*"
64
+
65
+ [components.transformer]
66
+ factory = "transformer"
67
+ max_batch_items = 4096
68
+ set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}
69
+
70
+ [components.transformer.model]
71
+ @architectures = "spacy-transformers.TransformerModel.v3"
72
+ name = ${paths.bert}
73
+ mixed_precision = false
74
+
75
+ [components.transformer.model.get_spans]
76
+ @span_getters = "spacy-transformers.strided_spans.v1"
77
+ window = 128
78
+ stride = 96
79
+
80
+ [components.transformer.model.grad_scaler_config]
81
+
82
+ [components.transformer.model.tokenizer_config]
83
+ use_fast = true
84
+
85
+ [components.transformer.model.transformer_config]
86
+
87
+ [corpora]
88
+
89
+ [corpora.dev]
90
+ @readers = "spacy.Corpus.v1"
91
+ path = ${paths.dev}
92
+ max_length = 0
93
+ gold_preproc = false
94
+ limit = 0
95
+ augmenter = null
96
+
97
+ [corpora.train]
98
+ @readers = "spacy.Corpus.v1"
99
+ path = ${paths.train}
100
+ max_length = 0
101
+ gold_preproc = false
102
+ limit = 0
103
+ augmenter = null
104
+
105
+ [training]
106
+ accumulate_gradient = 3
107
+ dev_corpus = "corpora.dev"
108
+ train_corpus = "corpora.train"
109
+ annotating_components = ["tagger"]
110
+ seed = ${system.seed}
111
+ gpu_allocator = ${system.gpu_allocator}
112
+ dropout = 0.1
113
+ patience = 1600
114
+ max_epochs = 0
115
+ max_steps = 20000
116
+ eval_frequency = 200
117
+ frozen_components = []
118
+ before_to_disk = null
119
+ before_update = null
120
+
121
+ [training.batcher]
122
+ @batchers = "spacy.batch_by_padded.v1"
123
+ discard_oversize = true
124
+ size = 2000
125
+ buffer = 256
126
+ get_length = null
127
+
128
+ [training.logger]
129
+ @loggers = "spacy.ConsoleLogger.v1"
130
+ progress_bar = false
131
+
132
+ [training.optimizer]
133
+ @optimizers = "Adam.v1"
134
+ beta1 = 0.9
135
+ beta2 = 0.999
136
+ L2_is_weight_decay = true
137
+ L2 = 0.01
138
+ grad_clip = 1.0
139
+ use_averages = false
140
+ eps = 0.00000001
141
+
142
+ [training.optimizer.learn_rate]
143
+ @schedules = "warmup_linear.v1"
144
+ warmup_steps = 250
145
+ total_steps = 20000
146
+ initial_rate = 0.00005
147
+
148
+ [training.score_weights]
149
+ tag_acc = 0.5
150
+ ents_f = 0.5
151
+ ents_p = 0.0
152
+ ents_r = 0.0
153
+ ents_per_type = null
154
+
155
+ [pretraining]
156
+
157
+ [initialize]
158
+ vectors = ${paths.vectors}
159
+ init_tok2vec = ${paths.init_tok2vec}
160
+ vocab_data = null
161
+ lookups = null
162
+ before_init = null
163
+ after_init = null
164
+
165
+ [initialize.components]
166
+
167
+ [initialize.tokenizer]
meta.json ADDED
@@ -0,0 +1,112 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "lang":"sr",
3
+ "name":"pner_tesla_j355",
4
+ "version":"1.0.0",
5
+ "description":"sr_pner_tesla_j355 is a spaCy model meticulously fine-tuned for Part-of-Speech Tagging, Lemmatization, and Named Entity Recognition in Serbian language texts. This advanced model incorporates a transformer layer based on Jerteh-355, enhancing its analytical capabilities. It is proficient in identifying 7 distinct categories of entities: PERS (persons), ROLE (professions), DEMO (demonyms), ORG (organizations), LOC (locations), WORK (artworks), and EVENT (events). Detailed information about these categories is available in the accompanying table. The development of this model has been made possible through the support of the Science Fund of the Republic of Serbia, under grant #7276, for the project 'Text Embeddings - Serbian Language Applications - TESLA'.",
6
+ "author":"Milica Ikoni\u0107 Ne\u0161i\u0107, Sa\u0161a Petalinkar, Mihailo \u0160kori\u0107, Ranka Stankovi\u0107",
7
+ "email":"",
8
+ "url":"https://tesla.rgf.bg.ac.rs/",
9
+ "license":"CC BY-SA 3.0",
10
+ "spacy_version":">=3.7.2,<3.8.0",
11
+ "spacy_git_version":"a89eae928",
12
+ "vectors":{
13
+ "width":0,
14
+ "vectors":0,
15
+ "keys":0,
16
+ "name":null
17
+ },
18
+ "labels":{
19
+ "transformer":[
20
+
21
+ ],
22
+ "tagger":[
23
+ "ADJ",
24
+ "ADP",
25
+ "ADV",
26
+ "AUX",
27
+ "CCONJ",
28
+ "DET",
29
+ "INTJ",
30
+ "NOUN",
31
+ "NUM",
32
+ "PART",
33
+ "PRON",
34
+ "PROPN",
35
+ "PUNCT",
36
+ "SCONJ",
37
+ "VERB",
38
+ "X"
39
+ ],
40
+ "ner":[
41
+ "DEMO",
42
+ "EVENT",
43
+ "LOC",
44
+ "ORG",
45
+ "PERS",
46
+ "ROLE",
47
+ "WORK"
48
+ ]
49
+ },
50
+ "pipeline":[
51
+ "transformer",
52
+ "tagger",
53
+ "ner"
54
+ ],
55
+ "components":[
56
+ "transformer",
57
+ "tagger",
58
+ "ner"
59
+ ],
60
+ "disabled":[
61
+
62
+ ],
63
+ "performance":{
64
+ "tag_acc":0.9841723761,
65
+ "ents_f":0.9556371476,
66
+ "ents_p":0.9516940624,
67
+ "ents_r":0.9596130429,
68
+ "ents_per_type":{
69
+ "ROLE":{
70
+ "p":0.8498759305,
71
+ "r":0.8995403808,
72
+ "f":0.8740031898
73
+ },
74
+ "PERS":{
75
+ "p":0.9859102244,
76
+ "r":0.9889931207,
77
+ "f":0.9874492663
78
+ },
79
+ "LOC":{
80
+ "p":0.9557385486,
81
+ "r":0.9789140749,
82
+ "f":0.9671875
83
+ },
84
+ "DEMO":{
85
+ "p":0.9150227618,
86
+ "r":0.9496062992,
87
+ "f":0.9319938176
88
+ },
89
+ "ORG":{
90
+ "p":0.835443038,
91
+ "r":0.7213114754,
92
+ "f":0.7741935484
93
+ },
94
+ "WORK":{
95
+ "p":0.6461538462,
96
+ "r":0.5915492958,
97
+ "f":0.6176470588
98
+ },
99
+ "EVENT":{
100
+ "p":0.6774193548,
101
+ "r":0.65625,
102
+ "f":0.6666666667
103
+ }
104
+ },
105
+ "transformer_loss":1514.3986259683,
106
+ "tagger_loss":1412.3081215593,
107
+ "ner_loss":840.4338100267
108
+ },
109
+ "requirements":[
110
+ "spacy-transformers>=1.3.4,<1.4.0"
111
+ ]
112
+ }
ner/cfg ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "moves":null,
3
+ "update_with_oracle_cut_size":100,
4
+ "multitasks":[
5
+
6
+ ],
7
+ "min_action_freq":1,
8
+ "learn_tokens":false,
9
+ "beam_width":1,
10
+ "beam_density":0.0,
11
+ "beam_update_prob":0.0,
12
+ "incorrect_spans_key":null
13
+ }
ner/model ADDED
Binary file (310 kB). View file
 
ner/moves ADDED
@@ -0,0 +1 @@
 
 
1
+ ��moves��{"0":{},"1":{"PERS":66081,"LOC":35152,"ROLE":14259,"ORG":10504,"DEMO":5087,"WORK":973,"EVENT":546},"2":{"PERS":66081,"LOC":35152,"ROLE":14259,"ORG":10504,"DEMO":5087,"WORK":973,"EVENT":546},"3":{"PERS":66081,"LOC":35152,"ROLE":14259,"ORG":10504,"DEMO":5087,"WORK":973,"EVENT":546},"4":{"PERS":66081,"LOC":35152,"ROLE":14259,"ORG":10504,"DEMO":5087,"WORK":973,"EVENT":546,"":1},"5":{"":1}}�cfg��neg_key�
sr_pner_tesla_j355-any-py3-none-any.whl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:396741d112805677bcd33da5746601a25c979f1352826aca98a052479903e4c6
3
+ size 1309084368
tagger/cfg ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "label_smoothing":0.0,
3
+ "labels":[
4
+ "ADJ",
5
+ "ADP",
6
+ "ADV",
7
+ "AUX",
8
+ "CCONJ",
9
+ "DET",
10
+ "INTJ",
11
+ "NOUN",
12
+ "NUM",
13
+ "PART",
14
+ "PRON",
15
+ "PROPN",
16
+ "PUNCT",
17
+ "SCONJ",
18
+ "VERB",
19
+ "X"
20
+ ],
21
+ "neg_prefix":"!",
22
+ "overwrite":false
23
+ }
tagger/model ADDED
Binary file (66.2 kB). View file
 
tokenizer ADDED
Binary file (32.6 kB). View file
 
transformer/cfg ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "max_batch_items":4096
3
+ }
transformer/model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e33ff19e8a8543d01c7158b35ac9a6bb518f562ff087d88f51cc6160fcfbfa31
3
+ size 1420537659
vocab/key2row ADDED
@@ -0,0 +1 @@
 
 
1
+
vocab/lookups.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:76be8b528d0075f7aae98d6fa57a6d3c83ae480a8469e668d7b0af968995ac71
3
+ size 1
vocab/strings.json ADDED
The diff for this file is too large to render. See raw diff
 
vocab/vectors ADDED
Binary file (128 Bytes). View file
 
vocab/vectors.cfg ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "mode":"default"
3
+ }