Aaron-96 commited on
Commit
c298231
1 Parent(s): 7f4260c

Upload 10 files

Browse files
README.md ADDED
@@ -0,0 +1,220 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: span-marker
3
+ tags:
4
+ - span-marker
5
+ - token-classification
6
+ - ner
7
+ - named-entity-recognition
8
+ - generated_from_span_marker_trainer
9
+ metrics:
10
+ - precision
11
+ - recall
12
+ - f1
13
+ widget:
14
+ - text: The Bengal tiger is the most common subspecies of tiger, constituting approximately
15
+ 80% of the entire tiger population, and is found in Bangladesh, Bhutan, Myanmar,
16
+ Nepal, and India.
17
+ - text: In other countries, it is a non-commissioned rank (e.g. Spain, Italy, France,
18
+ the Netherlands and the Indonesian Police ranks).
19
+ - text: The filling consists of fish, pork and bacon, and is seasoned with salt (unless
20
+ the pork is already salted).
21
+ - text: This stood until August 20, 1993 when it was beaten by one 1 / 100th of a
22
+ second by Colin Jackson of Great Britain in Stuttgart, Germany, a subsequent record
23
+ that stood for 13 years.
24
+ - text: Ann Patchett ’s novel " Bel Canto ", was another creative influence that helped
25
+ her manage a plentiful cast of characters.
26
+ pipeline_tag: token-classification
27
+ model-index:
28
+ - name: SpanMarker
29
+ results:
30
+ - task:
31
+ type: token-classification
32
+ name: Named Entity Recognition
33
+ dataset:
34
+ name: Unknown
35
+ type: unknown
36
+ split: eval
37
+ metrics:
38
+ - type: f1
39
+ value: 0.9130661114003124
40
+ name: F1
41
+ - type: precision
42
+ value: 0.9148758606300855
43
+ name: Precision
44
+ - type: recall
45
+ value: 0.9112635078969243
46
+ name: Recall
47
+ ---
48
+
49
+ # SpanMarker
50
+
51
+ This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model that can be used for Named Entity Recognition.
52
+
53
+ ## Model Details
54
+
55
+ ### Model Description
56
+ - **Model Type:** SpanMarker
57
+ <!-- - **Encoder:** [Unknown](https://huggingface.co/unknown) -->
58
+ - **Maximum Sequence Length:** 256 tokens
59
+ - **Maximum Entity Length:** 6 words
60
+ <!-- - **Training Dataset:** [Unknown](https://huggingface.co/datasets/unknown) -->
61
+ <!-- - **Language:** Unknown -->
62
+ <!-- - **License:** Unknown -->
63
+
64
+ ### Model Sources
65
+
66
+ - **Repository:** [SpanMarker on GitHub](https://github.com/tomaarsen/SpanMarkerNER)
67
+ - **Thesis:** [SpanMarker For Named Entity Recognition](https://raw.githubusercontent.com/tomaarsen/SpanMarkerNER/main/thesis.pdf)
68
+
69
+ ### Model Labels
70
+ | Label | Examples |
71
+ |:------|:-------------------------------------------------------------------------|
72
+ | ANIM | "vertebrate", "moth", "G. firmus" |
73
+ | BIO | "Aspergillus", "Cladophora", "Zythiostroma" |
74
+ | CEL | "pulsar", "celestial bodies", "neutron star" |
75
+ | DIS | "social anxiety disorder", "insulin resistance", "Asperger syndrome" |
76
+ | EVE | "Spanish Civil War", "National Junior Angus Show", "French Revolution" |
77
+ | FOOD | "Neera", "Bellini ( cocktail )", "soju" |
78
+ | INST | "Apple II", "Encyclopaedia of Chess Openings", "Android" |
79
+ | LOC | "Kīlauea", "Hungary", "Vienna" |
80
+ | MEDIA | "CSI : Crime Scene Investigation", "Big Comic Spirits", "American Idol" |
81
+ | MYTH | "Priam", "Oźwiena", "Odysseus" |
82
+ | ORG | "San Francisco Giants", "Arm Holdings", "RTÉ One" |
83
+ | PER | "Amelia Bence", "Tito Lusiardo", "James Cameron" |
84
+ | PLANT | "vernal squill", "Sarracenia purpurea", "Drosera rotundifolia" |
85
+ | TIME | "prehistory", "Age of Enlightenment", "annual paid holiday" |
86
+ | VEHI | "Short 360", "Ferrari 355 Challenge", "Solution F / Chretien Helicopter" |
87
+
88
+ ## Uses
89
+
90
+ ### Direct Use for Inference
91
+
92
+ ```python
93
+ from span_marker import SpanMarkerModel
94
+
95
+ # Download from the 🤗 Hub
96
+ model = SpanMarkerModel.from_pretrained("span_marker_model_id")
97
+ # Run inference
98
+ entities = model.predict("Ann Patchett ’s novel \" Bel Canto \", was another creative influence that helped her manage a plentiful cast of characters.")
99
+ ```
100
+
101
+ ### Downstream Use
102
+ You can finetune this model on your own dataset.
103
+
104
+ <details><summary>Click to expand</summary>
105
+
106
+ ```python
107
+ from span_marker import SpanMarkerModel, Trainer
108
+
109
+ # Download from the 🤗 Hub
110
+ model = SpanMarkerModel.from_pretrained("span_marker_model_id")
111
+
112
+ # Specify a Dataset with "tokens" and "ner_tag" columns
113
+ dataset = load_dataset("conll2003") # For example CoNLL2003
114
+
115
+ # Initialize a Trainer using the pretrained model & dataset
116
+ trainer = Trainer(
117
+ model=model,
118
+ train_dataset=dataset["train"],
119
+ eval_dataset=dataset["validation"],
120
+ )
121
+ trainer.train()
122
+ trainer.save_model("span_marker_model_id-finetuned")
123
+ ```
124
+ </details>
125
+
126
+ <!--
127
+ ### Out-of-Scope Use
128
+
129
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
130
+ -->
131
+
132
+ <!--
133
+ ## Bias, Risks and Limitations
134
+
135
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
136
+ -->
137
+
138
+ <!--
139
+ ### Recommendations
140
+
141
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
142
+ -->
143
+
144
+ ## Training Details
145
+
146
+ ### Training Set Metrics
147
+ | Training set | Min | Median | Max |
148
+ |:----------------------|:----|:--------|:----|
149
+ | Sentence length | 2 | 21.6493 | 237 |
150
+ | Entities per sentence | 0 | 1.5369 | 36 |
151
+
152
+ ### Training Hyperparameters
153
+ - learning_rate: 1e-05
154
+ - train_batch_size: 16
155
+ - eval_batch_size: 16
156
+ - seed: 42
157
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
158
+ - lr_scheduler_type: linear
159
+ - lr_scheduler_warmup_ratio: 0.1
160
+ - num_epochs: 1
161
+ - mixed_precision_training: Native AMP
162
+
163
+ ### Training Results
164
+ | Epoch | Step | Validation Loss | Validation Precision | Validation Recall | Validation F1 | Validation Accuracy |
165
+ |:------:|:-----:|:---------------:|:--------------------:|:-----------------:|:-------------:|:-------------------:|
166
+ | 0.0576 | 1000 | 0.0142 | 0.8714 | 0.7729 | 0.8192 | 0.9698 |
167
+ | 0.1153 | 2000 | 0.0107 | 0.8316 | 0.8815 | 0.8558 | 0.9744 |
168
+ | 0.1729 | 3000 | 0.0092 | 0.8717 | 0.8797 | 0.8757 | 0.9780 |
169
+ | 0.2306 | 4000 | 0.0082 | 0.8811 | 0.8886 | 0.8848 | 0.9798 |
170
+ | 0.2882 | 5000 | 0.0084 | 0.8523 | 0.9163 | 0.8831 | 0.9790 |
171
+ | 0.3459 | 6000 | 0.0079 | 0.8700 | 0.9113 | 0.8902 | 0.9802 |
172
+ | 0.4035 | 7000 | 0.0070 | 0.9107 | 0.8859 | 0.8981 | 0.9822 |
173
+ | 0.4611 | 8000 | 0.0069 | 0.9259 | 0.8797 | 0.9022 | 0.9827 |
174
+ | 0.5188 | 9000 | 0.0067 | 0.9061 | 0.8965 | 0.9013 | 0.9829 |
175
+ | 0.5764 | 10000 | 0.0066 | 0.9034 | 0.8996 | 0.9015 | 0.9829 |
176
+ | 0.6341 | 11000 | 0.0064 | 0.9160 | 0.8996 | 0.9077 | 0.9839 |
177
+ | 0.6917 | 12000 | 0.0066 | 0.8952 | 0.9121 | 0.9036 | 0.9832 |
178
+ | 0.7494 | 13000 | 0.0062 | 0.9165 | 0.9009 | 0.9086 | 0.9841 |
179
+ | 0.8070 | 14000 | 0.0062 | 0.9010 | 0.9121 | 0.9065 | 0.9835 |
180
+ | 0.8647 | 15000 | 0.0062 | 0.9084 | 0.9127 | 0.9105 | 0.9842 |
181
+ | 0.9223 | 16000 | 0.0060 | 0.9151 | 0.9098 | 0.9125 | 0.9846 |
182
+ | 0.9799 | 17000 | 0.0060 | 0.9149 | 0.9113 | 0.9131 | 0.9848 |
183
+
184
+ ### Framework Versions
185
+ - Python: 3.8.16
186
+ - SpanMarker: 1.5.0
187
+ - Transformers: 4.29.0.dev0
188
+ - PyTorch: 1.10.1
189
+ - Datasets: 2.15.0
190
+ - Tokenizers: 0.13.2
191
+
192
+ ## Citation
193
+
194
+ ### BibTeX
195
+ ```
196
+ @software{Aarsen_SpanMarker,
197
+ author = {Aarsen, Tom},
198
+ license = {Apache-2.0},
199
+ title = {{SpanMarker for Named Entity Recognition}},
200
+ url = {https://github.com/tomaarsen/SpanMarkerNER}
201
+ }
202
+ ```
203
+
204
+ <!--
205
+ ## Glossary
206
+
207
+ *Clearly define terms in order to be accessible across audiences.*
208
+ -->
209
+
210
+ <!--
211
+ ## Model Card Authors
212
+
213
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
214
+ -->
215
+
216
+ <!--
217
+ ## Model Card Contact
218
+
219
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
220
+ -->
added_tokens.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "<end>": 50266,
3
+ "<start>": 50265
4
+ }
config.json ADDED
@@ -0,0 +1,225 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "SpanMarkerModel"
4
+ ],
5
+ "encoder": {
6
+ "_name_or_path": "roberta-base",
7
+ "add_cross_attention": false,
8
+ "architectures": [
9
+ "RobertaForMaskedLM"
10
+ ],
11
+ "attention_probs_dropout_prob": 0.1,
12
+ "bad_words_ids": null,
13
+ "begin_suppress_tokens": null,
14
+ "bos_token_id": 0,
15
+ "chunk_size_feed_forward": 0,
16
+ "classifier_dropout": null,
17
+ "cross_attention_hidden_size": null,
18
+ "decoder_start_token_id": null,
19
+ "diversity_penalty": 0.0,
20
+ "do_sample": false,
21
+ "early_stopping": false,
22
+ "encoder_no_repeat_ngram_size": 0,
23
+ "eos_token_id": 2,
24
+ "exponential_decay_length_penalty": null,
25
+ "finetuning_task": null,
26
+ "forced_bos_token_id": null,
27
+ "forced_eos_token_id": null,
28
+ "hidden_act": "gelu",
29
+ "hidden_dropout_prob": 0.1,
30
+ "hidden_size": 768,
31
+ "id2label": {
32
+ "0": "O",
33
+ "1": "B-PER",
34
+ "2": "I-PER",
35
+ "3": "B-ORG",
36
+ "4": "I-ORG",
37
+ "5": "B-LOC",
38
+ "6": "I-LOC",
39
+ "7": "B-ANIM",
40
+ "8": "I-ANIM",
41
+ "9": "B-BIO",
42
+ "10": "I-BIO",
43
+ "11": "B-CEL",
44
+ "12": "I-CEL",
45
+ "13": "B-DIS",
46
+ "14": "I-DIS",
47
+ "15": "B-EVE",
48
+ "16": "I-EVE",
49
+ "17": "B-FOOD",
50
+ "18": "I-FOOD",
51
+ "19": "B-INST",
52
+ "20": "I-INST",
53
+ "21": "B-MEDIA",
54
+ "22": "I-MEDIA",
55
+ "23": "B-MYTH",
56
+ "24": "I-MYTH",
57
+ "25": "B-PLANT",
58
+ "26": "I-PLANT",
59
+ "27": "B-TIME",
60
+ "28": "I-TIME",
61
+ "29": "B-VEHI",
62
+ "30": "I-VEHI"
63
+ },
64
+ "initializer_range": 0.02,
65
+ "intermediate_size": 3072,
66
+ "is_decoder": false,
67
+ "is_encoder_decoder": false,
68
+ "label2id": {
69
+ "B-ANIM": 7,
70
+ "B-BIO": 9,
71
+ "B-CEL": 11,
72
+ "B-DIS": 13,
73
+ "B-EVE": 15,
74
+ "B-FOOD": 17,
75
+ "B-INST": 19,
76
+ "B-LOC": 5,
77
+ "B-MEDIA": 21,
78
+ "B-MYTH": 23,
79
+ "B-ORG": 3,
80
+ "B-PER": 1,
81
+ "B-PLANT": 25,
82
+ "B-TIME": 27,
83
+ "B-VEHI": 29,
84
+ "I-ANIM": 8,
85
+ "I-BIO": 10,
86
+ "I-CEL": 12,
87
+ "I-DIS": 14,
88
+ "I-EVE": 16,
89
+ "I-FOOD": 18,
90
+ "I-INST": 20,
91
+ "I-LOC": 6,
92
+ "I-MEDIA": 22,
93
+ "I-MYTH": 24,
94
+ "I-ORG": 4,
95
+ "I-PER": 2,
96
+ "I-PLANT": 26,
97
+ "I-TIME": 28,
98
+ "I-VEHI": 30,
99
+ "O": 0
100
+ },
101
+ "layer_norm_eps": 1e-05,
102
+ "length_penalty": 1.0,
103
+ "max_length": 20,
104
+ "max_position_embeddings": 514,
105
+ "min_length": 0,
106
+ "model_type": "roberta",
107
+ "no_repeat_ngram_size": 0,
108
+ "num_attention_heads": 12,
109
+ "num_beam_groups": 1,
110
+ "num_beams": 1,
111
+ "num_hidden_layers": 12,
112
+ "num_return_sequences": 1,
113
+ "output_attentions": false,
114
+ "output_hidden_states": false,
115
+ "output_scores": false,
116
+ "pad_token_id": 1,
117
+ "position_embedding_type": "absolute",
118
+ "prefix": null,
119
+ "problem_type": null,
120
+ "pruned_heads": {},
121
+ "remove_invalid_values": false,
122
+ "repetition_penalty": 1.0,
123
+ "return_dict": true,
124
+ "return_dict_in_generate": false,
125
+ "sep_token_id": null,
126
+ "suppress_tokens": null,
127
+ "task_specific_params": null,
128
+ "temperature": 1.0,
129
+ "tf_legacy_loss": false,
130
+ "tie_encoder_decoder": false,
131
+ "tie_word_embeddings": true,
132
+ "tokenizer_class": null,
133
+ "top_k": 50,
134
+ "top_p": 1.0,
135
+ "torch_dtype": null,
136
+ "torchscript": false,
137
+ "transformers_version": "4.29.0.dev0",
138
+ "type_vocab_size": 1,
139
+ "typical_p": 1.0,
140
+ "use_bfloat16": false,
141
+ "use_cache": true,
142
+ "vocab_size": 50267
143
+ },
144
+ "entity_max_length": 6,
145
+ "id2label": {
146
+ "0": "O",
147
+ "1": "ANIM",
148
+ "2": "BIO",
149
+ "3": "CEL",
150
+ "4": "DIS",
151
+ "5": "EVE",
152
+ "6": "FOOD",
153
+ "7": "INST",
154
+ "8": "LOC",
155
+ "9": "MEDIA",
156
+ "10": "MYTH",
157
+ "11": "ORG",
158
+ "12": "PER",
159
+ "13": "PLANT",
160
+ "14": "TIME",
161
+ "15": "VEHI"
162
+ },
163
+ "id2reduced_id": {
164
+ "0": 0,
165
+ "1": 12,
166
+ "2": 12,
167
+ "3": 11,
168
+ "4": 11,
169
+ "5": 8,
170
+ "6": 8,
171
+ "7": 1,
172
+ "8": 1,
173
+ "9": 2,
174
+ "10": 2,
175
+ "11": 3,
176
+ "12": 3,
177
+ "13": 4,
178
+ "14": 4,
179
+ "15": 5,
180
+ "16": 5,
181
+ "17": 6,
182
+ "18": 6,
183
+ "19": 7,
184
+ "20": 7,
185
+ "21": 9,
186
+ "22": 9,
187
+ "23": 10,
188
+ "24": 10,
189
+ "25": 13,
190
+ "26": 13,
191
+ "27": 14,
192
+ "28": 14,
193
+ "29": 15,
194
+ "30": 15
195
+ },
196
+ "label2id": {
197
+ "ANIM": 1,
198
+ "BIO": 2,
199
+ "CEL": 3,
200
+ "DIS": 4,
201
+ "EVE": 5,
202
+ "FOOD": 6,
203
+ "INST": 7,
204
+ "LOC": 8,
205
+ "MEDIA": 9,
206
+ "MYTH": 10,
207
+ "O": 0,
208
+ "ORG": 11,
209
+ "PER": 12,
210
+ "PLANT": 13,
211
+ "TIME": 14,
212
+ "VEHI": 15
213
+ },
214
+ "marker_max_length": 128,
215
+ "max_next_context": null,
216
+ "max_prev_context": null,
217
+ "model_max_length": 256,
218
+ "model_max_length_default": 512,
219
+ "model_type": "span-marker",
220
+ "span_marker_version": "1.5.0",
221
+ "torch_dtype": "float32",
222
+ "trained_with_document_context": false,
223
+ "transformers_version": "4.29.0.dev0",
224
+ "vocab_size": 50267
225
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9dc93bdd5f259d2792df17836946c0030d24d300a992b2ddd7325cbef92bb381
3
+ size 498758701
special_tokens_map.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<s>",
3
+ "cls_token": "<s>",
4
+ "eos_token": "</s>",
5
+ "mask_token": {
6
+ "content": "<mask>",
7
+ "lstrip": true,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false
11
+ },
12
+ "pad_token": "<pad>",
13
+ "sep_token": "</s>",
14
+ "unk_token": "<unk>"
15
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": true,
3
+ "bos_token": "<s>",
4
+ "clean_up_tokenization_spaces": true,
5
+ "cls_token": "<s>",
6
+ "entity_max_length": 6,
7
+ "eos_token": "</s>",
8
+ "errors": "replace",
9
+ "marker_max_length": 128,
10
+ "mask_token": "<mask>",
11
+ "model_max_length": 256,
12
+ "pad_token": "<pad>",
13
+ "sep_token": "</s>",
14
+ "tokenizer_class": "RobertaTokenizer",
15
+ "trim_offsets": true,
16
+ "unk_token": "<unk>"
17
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2df4ccd3adf73377163a8dab4acb4eb632991616891f734eab96f2aed9cb50b1
3
+ size 3887
vocab.json ADDED
The diff for this file is too large to render. See raw diff