tomaarsen HF staff commited on
Commit
66ddb3f
1 Parent(s): 2cee37f

Use updated, generated README

Browse files
Files changed (1) hide show
  1. README.md +108 -11
README.md CHANGED
@@ -1,4 +1,7 @@
1
  ---
 
 
 
2
  library_name: span-marker
3
  tags:
4
  - span-marker
@@ -6,35 +9,92 @@ tags:
6
  - ner
7
  - named-entity-recognition
8
  - generated_from_span_marker_trainer
 
 
9
  metrics:
10
  - precision
11
  - recall
12
  - f1
13
- widget: []
 
 
 
 
 
 
 
 
14
  pipeline_tag: token-classification
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  ---
16
 
17
- # SpanMarker
18
 
19
- This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model that can be used for Named Entity Recognition.
 
 
20
 
21
  ## Model Details
22
 
23
  ### Model Description
24
 
25
  - **Model Type:** SpanMarker
26
- <!-- - **Encoder:** [Unknown](https://huggingface.co/unknown) -->
27
  - **Maximum Sequence Length:** 256 tokens
28
  - **Maximum Entity Length:** 8 words
29
- <!-- - **Training Dataset:** [Unknown](https://huggingface.co/datasets/unknown) -->
30
- <!-- - **Language:** Unknown -->
31
- <!-- - **License:** Unknown -->
32
 
33
  ### Model Sources
34
 
35
  - **Repository:** [SpanMarker on GitHub](https://github.com/tomaarsen/SpanMarkerNER)
36
  - **Thesis:** [SpanMarker For Named Entity Recognition](https://raw.githubusercontent.com/tomaarsen/SpanMarkerNER/main/thesis.pdf)
37
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38
  ## Uses
39
 
40
  ### Direct Use for Inference
@@ -43,9 +103,9 @@ This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model that ca
43
  from span_marker import SpanMarkerModel
44
 
45
  # Download from the 🤗 Hub
46
- model = SpanMarkerModel.from_pretrained("span_marker_model_id")
47
  # Run inference
48
- entities = model.predict("Amelia Earhart flew her single engine Lockheed Vega 5B across the Atlantic to Paris.")
49
  ```
50
 
51
  ### Downstream Use
@@ -57,7 +117,7 @@ You can finetune this model on your own dataset.
57
  from span_marker import SpanMarkerModel, Trainer
58
 
59
  # Download from the 🤗 Hub
60
- model = SpanMarkerModel.from_pretrained("span_marker_model_id")
61
 
62
  # Specify a Dataset with "tokens" and "ner_tag" columns
63
  dataset = load_dataset("conll2003") # For example CoNLL2003
@@ -69,7 +129,7 @@ trainer = Trainer(
69
  eval_dataset=dataset["validation"],
70
  )
71
  trainer.train()
72
- trainer.save_model("span_marker_model_id-finetuned")
73
  ```
74
  </details>
75
 
@@ -93,6 +153,43 @@ trainer.save_model("span_marker_model_id-finetuned")
93
 
94
  ## Training Details
95
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
96
  ### Framework Versions
97
 
98
  - Python: 3.9.16
 
1
  ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
  library_name: span-marker
6
  tags:
7
  - span-marker
 
9
  - ner
10
  - named-entity-recognition
11
  - generated_from_span_marker_trainer
12
+ datasets:
13
+ - acronym_identification
14
  metrics:
15
  - precision
16
  - recall
17
  - f1
18
+ widget:
19
+ - text: "here, da = direct assessment, rr = relative ranking, ds = discrete scale and cs = continuous scale."
20
+ example_title: "Uncased 1"
21
+ - text: "modifying or replacing the erasable programmable read only memory (eprom) in a phone would allow the configuration of any esn and min via software for cellular devices."
22
+ example_title: "Uncased 2"
23
+ - text: "we propose a technique called aggressive stochastic weight averaging (aswa) and an extension called norm-filtered aggressive stochastic weight averaging (naswa) which improves te stability of models over random seeds."
24
+ example_title: "Uncased 3"
25
+ - text: "the choice of the encoder and decoder modules of dnpg can be quite flexible, for instance long-short term memory networks (lstm) or convolutional neural network (cnn)."
26
+ example_title: "Uncased 4"
27
  pipeline_tag: token-classification
28
+ co2_eq_emissions:
29
+ emissions: 31.203903222402037
30
+ source: codecarbon
31
+ training_type: fine-tuning
32
+ on_cloud: false
33
+ cpu_model: 13th Gen Intel(R) Core(TM) i7-13700K
34
+ ram_total_size: 31.777088165283203
35
+ hours_used: 0.272
36
+ hardware_used: 1 x NVIDIA GeForce RTX 3090
37
+ base_model: bert-base-uncased
38
+ model-index:
39
+ - name: SpanMarker with bert-base-uncased on Acronym Identification
40
+ results:
41
+ - task:
42
+ type: token-classification
43
+ name: Named Entity Recognition
44
+ dataset:
45
+ name: Acronym Identification
46
+ type: acronym_identification
47
+ split: validation
48
+ metrics:
49
+ - type: f1
50
+ value: 0.9198933333333332
51
+ name: F1
52
+ - type: precision
53
+ value: 0.9339397877409573
54
+ name: Precision
55
+ - type: recall
56
+ value: 0.9062631357713324
57
+ name: Recall
58
  ---
59
 
60
+ # SpanMarker with bert-base-uncased on Acronym Identification
61
 
62
+ This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model trained on the [Acronym Identification](https://huggingface.co/datasets/acronym_identification) dataset that can be used for Named Entity Recognition. This SpanMarker model uses [bert-base-uncased](https://huggingface.co/bert-base-uncased) as the underlying encoder. See [train.py](train.py) for the training script.
63
+
64
+ Is your data always capitalized correctly? Then consider using the cased variant of this model instead for better performance: tomaarsen/span-marker-bert-base-acronyms.
65
 
66
  ## Model Details
67
 
68
  ### Model Description
69
 
70
  - **Model Type:** SpanMarker
71
+ - **Encoder:** [bert-base-uncased](https://huggingface.co/bert-base-uncased)
72
  - **Maximum Sequence Length:** 256 tokens
73
  - **Maximum Entity Length:** 8 words
74
+ - **Training Dataset:** [Acronym Identification](https://huggingface.co/datasets/acronym_identification)
75
+ - **Language:** en
76
+ - **License:** apache-2.0
77
 
78
  ### Model Sources
79
 
80
  - **Repository:** [SpanMarker on GitHub](https://github.com/tomaarsen/SpanMarkerNER)
81
  - **Thesis:** [SpanMarker For Named Entity Recognition](https://raw.githubusercontent.com/tomaarsen/SpanMarkerNER/main/thesis.pdf)
82
 
83
+ ### Model Labels
84
+ | Label | Examples |
85
+ |:------|:------------------------------------------------------------------------------------------------------|
86
+ | long | "successive convex approximation", "controlled natural language", "Conversational Question Answering" |
87
+ | short | "SODA", "CNL", "CoQA" |
88
+
89
+ ## Evaluation
90
+
91
+ ### Metrics
92
+ | Label | Precision | Recall | F1 |
93
+ |:--------|:----------|:-------|:-------|
94
+ | **all** | 0.9339 | 0.9063 | 0.9199 |
95
+ | long | 0.9314 | 0.8845 | 0.9074 |
96
+ | short | 0.9352 | 0.9174 | 0.9262 |
97
+
98
  ## Uses
99
 
100
  ### Direct Use for Inference
 
103
  from span_marker import SpanMarkerModel
104
 
105
  # Download from the 🤗 Hub
106
+ model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-bert-base-uncased-acronyms")
107
  # Run inference
108
+ entities = model.predict("compression algorithms like principal component analysis (pca) can reduce noise and complexity.")
109
  ```
110
 
111
  ### Downstream Use
 
117
  from span_marker import SpanMarkerModel, Trainer
118
 
119
  # Download from the 🤗 Hub
120
+ model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-bert-base-uncased-acronyms")
121
 
122
  # Specify a Dataset with "tokens" and "ner_tag" columns
123
  dataset = load_dataset("conll2003") # For example CoNLL2003
 
129
  eval_dataset=dataset["validation"],
130
  )
131
  trainer.train()
132
+ trainer.save_model("tomaarsen/span-marker-bert-base-uncased-acronyms-finetuned")
133
  ```
134
  </details>
135
 
 
153
 
154
  ## Training Details
155
 
156
+ ### Training Set Metrics
157
+ | Training set | Min | Median | Max |
158
+ |:----------------------|:----|:--------|:----|
159
+ | Sentence length | 4 | 32.3372 | 170 |
160
+ | Entities per sentence | 0 | 2.6775 | 24 |
161
+
162
+ ### Training Hyperparameters
163
+ - learning_rate: 5e-05
164
+ - train_batch_size: 32
165
+ - eval_batch_size: 32
166
+ - seed: 42
167
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
168
+ - lr_scheduler_type: linear
169
+ - lr_scheduler_warmup_ratio: 0.1
170
+ - num_epochs: 2
171
+
172
+ ### Training Results
173
+ | Epoch | Step | Validation Loss | Validation Precision | Validation Recall | Validation F1 | Validation Accuracy |
174
+ |:------:|:----:|:---------------:|:--------------------:|:-----------------:|:-------------:|:-------------------:|
175
+ | 0.3120 | 200 | 0.0097 | 0.8999 | 0.8731 | 0.8863 | 0.9718 |
176
+ | 0.6240 | 400 | 0.0075 | 0.9163 | 0.8995 | 0.9078 | 0.9769 |
177
+ | 0.9360 | 600 | 0.0076 | 0.9079 | 0.9153 | 0.9116 | 0.9773 |
178
+ | 1.2480 | 800 | 0.0069 | 0.9267 | 0.9006 | 0.9135 | 0.9778 |
179
+ | 1.5601 | 1000 | 0.0065 | 0.9268 | 0.9044 | 0.9154 | 0.9782 |
180
+ | 1.8721 | 1200 | 0.0065 | 0.9279 | 0.9061 | 0.9168 | 0.9787 |
181
+
182
+ ### Environmental Impact
183
+ Carbon emissions were measured using [CodeCarbon](https://github.com/mlco2/codecarbon).
184
+ - **Carbon Emitted**: 0.031 kg of CO2
185
+ - **Hours Used**: 0.272 hours
186
+
187
+ ### Training Hardware
188
+ - **On Cloud**: No
189
+ - **GPU Model**: 1 x NVIDIA GeForce RTX 3090
190
+ - **CPU Model**: 13th Gen Intel(R) Core(TM) i7-13700K
191
+ - **RAM Size**: 31.78 GB
192
+
193
  ### Framework Versions
194
 
195
  - Python: 3.9.16