dmozzherin commited on
Commit
70bd5df
·
verified ·
1 Parent(s): 7f0c902

Add model card

Browse files
Files changed (1) hide show
  1. README.md +100 -3
README.md CHANGED
@@ -1,3 +1,100 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ tags:
6
+ - token-classification
7
+ - ner
8
+ - biology
9
+ - entomology
10
+ - natural-history
11
+ - deberta
12
+ base_model:
13
+ - microsoft/deberta-v3-small
14
+ - microsoft/deberta-v3-base
15
+ - microsoft/deberta-v3-large
16
+ pipeline_tag: token-classification
17
+ ---
18
+
19
+ # ento-label-deberta
20
+
21
+ DeBERTa-v3 models fine-tuned for NER on insect collection labels. Given a raw
22
+ label string the model extracts semantic fields as verbatim character spans.
23
+
24
+ Three sizes are included in this repo: `small`, `base`, and `large`
25
+ (subdirectories of the same name). ONNX exports are in `onnx/small`,
26
+ `onnx/base`, and `onnx/large`.
27
+
28
+ ## Entity types
29
+
30
+ | Label | Description |
31
+ |---|---|
32
+ | `country` | Country name |
33
+ | `state` | State, province, or region |
34
+ | `verbatim_locality` | Locality description |
35
+ | `verbatim_date` | Collection date as written |
36
+ | `verbatim_elevation` | Elevation as written |
37
+ | `verbatim_collectors` | Collector name(s) |
38
+ | `verbatim_habitat` | Habitat description |
39
+ | `verbatim_method` | Collection method |
40
+ | `verbatim_latitude` | Latitude as written |
41
+ | `verbatim_longitude` | Longitude as written |
42
+
43
+ ## Evaluation results (macro F1 per entity)
44
+
45
+ | Entity | small | base | large |
46
+ |---|---|---|---|
47
+ | country | 0.9695 | 0.9749 | 0.9751 |
48
+ | state | 0.9046 | 0.9220 | 0.9212 |
49
+ | verbatim_locality | 0.8282 | 0.8499 | 0.8573 |
50
+ | verbatim_date | 0.9673 | 0.9700 | 0.9693 |
51
+ | verbatim_elevation | 0.9722 | 0.9742 | 0.9739 |
52
+ | verbatim_collectors | 0.4867 | 0.5393 | 0.5311 |
53
+ | verbatim_habitat | 0.7485 | 0.7751 | 0.7930 |
54
+ | verbatim_method | 0.9123 | 0.9205 | 0.9080 |
55
+ | verbatim_latitude | 0.7154 | 0.7145 | 0.6512 |
56
+ | verbatim_longitude | 0.8552 | 0.8528 | 0.7969 |
57
+ | **macro avg** | **0.8360** | **0.8493** | **0.8377** |
58
+
59
+ ## Usage (PyTorch)
60
+
61
+ ```python
62
+ from transformers import pipeline
63
+
64
+ ner = pipeline(
65
+ "token-classification",
66
+ model="SpeciesFileGroup/ento-label-deberta/base",
67
+ aggregation_strategy="simple",
68
+ )
69
+
70
+ results = ner("Sudan, Blue Nile: Abu Hashim, 23-24.XI.1962, coll. Linnavuori")
71
+ for r in results:
72
+ print(r["entity_group"], repr(r["word"]))
73
+ # country 'Sudan'
74
+ # state 'Blue Nile'
75
+ # verbatim_locality 'Abu Hashim'
76
+ # verbatim_date '23-24.XI.1962'
77
+ # verbatim_collectors 'Linnavuori'
78
+ ```
79
+
80
+ ## Usage (ONNX / hugot)
81
+
82
+ ONNX models are compatible with
83
+ [hugot](https://github.com/knights-analytics/hugot) and ONNX Runtime. Load
84
+ from `onnx/small`, `onnx/base`, or `onnx/large`.
85
+
86
+ ## Training
87
+
88
+ Fine-tuned for 5 epochs with the HuggingFace `Trainer`. Hyperparameters:
89
+
90
+ | Parameter | small / base | large |
91
+ |---|---|---|
92
+ | Learning rate | 5e-6 | 2e-6 |
93
+ | Batch size | 16 | 16 |
94
+ | LR scheduler | linear | linear |
95
+ | Warmup ratio | 0.06 | 0.06 |
96
+ | Weight decay | 0.01 | 0.01 |
97
+ | Max seq length | 128 | 128 |
98
+
99
+ Training data: ~22 000 insect collection label strings with character-span
100
+ annotations for the 10 entity types above.