Add model card
Browse files
README.md
CHANGED
|
@@ -1,3 +1,100 @@
|
|
| 1 |
-
---
|
| 2 |
-
|
| 3 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- en
|
| 4 |
+
license: apache-2.0
|
| 5 |
+
tags:
|
| 6 |
+
- token-classification
|
| 7 |
+
- ner
|
| 8 |
+
- biology
|
| 9 |
+
- entomology
|
| 10 |
+
- natural-history
|
| 11 |
+
- deberta
|
| 12 |
+
base_model:
|
| 13 |
+
- microsoft/deberta-v3-small
|
| 14 |
+
- microsoft/deberta-v3-base
|
| 15 |
+
- microsoft/deberta-v3-large
|
| 16 |
+
pipeline_tag: token-classification
|
| 17 |
+
---
|
| 18 |
+
|
| 19 |
+
# ento-label-deberta
|
| 20 |
+
|
| 21 |
+
DeBERTa-v3 models fine-tuned for NER on insect collection labels. Given a raw
|
| 22 |
+
label string the model extracts semantic fields as verbatim character spans.
|
| 23 |
+
|
| 24 |
+
Three sizes are included in this repo: `small`, `base`, and `large`
|
| 25 |
+
(subdirectories of the same name). ONNX exports are in `onnx/small`,
|
| 26 |
+
`onnx/base`, and `onnx/large`.
|
| 27 |
+
|
| 28 |
+
## Entity types
|
| 29 |
+
|
| 30 |
+
| Label | Description |
|
| 31 |
+
|---|---|
|
| 32 |
+
| `country` | Country name |
|
| 33 |
+
| `state` | State, province, or region |
|
| 34 |
+
| `verbatim_locality` | Locality description |
|
| 35 |
+
| `verbatim_date` | Collection date as written |
|
| 36 |
+
| `verbatim_elevation` | Elevation as written |
|
| 37 |
+
| `verbatim_collectors` | Collector name(s) |
|
| 38 |
+
| `verbatim_habitat` | Habitat description |
|
| 39 |
+
| `verbatim_method` | Collection method |
|
| 40 |
+
| `verbatim_latitude` | Latitude as written |
|
| 41 |
+
| `verbatim_longitude` | Longitude as written |
|
| 42 |
+
|
| 43 |
+
## Evaluation results (macro F1 per entity)
|
| 44 |
+
|
| 45 |
+
| Entity | small | base | large |
|
| 46 |
+
|---|---|---|---|
|
| 47 |
+
| country | 0.9695 | 0.9749 | 0.9751 |
|
| 48 |
+
| state | 0.9046 | 0.9220 | 0.9212 |
|
| 49 |
+
| verbatim_locality | 0.8282 | 0.8499 | 0.8573 |
|
| 50 |
+
| verbatim_date | 0.9673 | 0.9700 | 0.9693 |
|
| 51 |
+
| verbatim_elevation | 0.9722 | 0.9742 | 0.9739 |
|
| 52 |
+
| verbatim_collectors | 0.4867 | 0.5393 | 0.5311 |
|
| 53 |
+
| verbatim_habitat | 0.7485 | 0.7751 | 0.7930 |
|
| 54 |
+
| verbatim_method | 0.9123 | 0.9205 | 0.9080 |
|
| 55 |
+
| verbatim_latitude | 0.7154 | 0.7145 | 0.6512 |
|
| 56 |
+
| verbatim_longitude | 0.8552 | 0.8528 | 0.7969 |
|
| 57 |
+
| **macro avg** | **0.8360** | **0.8493** | **0.8377** |
|
| 58 |
+
|
| 59 |
+
## Usage (PyTorch)
|
| 60 |
+
|
| 61 |
+
```python
|
| 62 |
+
from transformers import pipeline
|
| 63 |
+
|
| 64 |
+
ner = pipeline(
|
| 65 |
+
"token-classification",
|
| 66 |
+
model="SpeciesFileGroup/ento-label-deberta/base",
|
| 67 |
+
aggregation_strategy="simple",
|
| 68 |
+
)
|
| 69 |
+
|
| 70 |
+
results = ner("Sudan, Blue Nile: Abu Hashim, 23-24.XI.1962, coll. Linnavuori")
|
| 71 |
+
for r in results:
|
| 72 |
+
print(r["entity_group"], repr(r["word"]))
|
| 73 |
+
# country 'Sudan'
|
| 74 |
+
# state 'Blue Nile'
|
| 75 |
+
# verbatim_locality 'Abu Hashim'
|
| 76 |
+
# verbatim_date '23-24.XI.1962'
|
| 77 |
+
# verbatim_collectors 'Linnavuori'
|
| 78 |
+
```
|
| 79 |
+
|
| 80 |
+
## Usage (ONNX / hugot)
|
| 81 |
+
|
| 82 |
+
ONNX models are compatible with
|
| 83 |
+
[hugot](https://github.com/knights-analytics/hugot) and ONNX Runtime. Load
|
| 84 |
+
from `onnx/small`, `onnx/base`, or `onnx/large`.
|
| 85 |
+
|
| 86 |
+
## Training
|
| 87 |
+
|
| 88 |
+
Fine-tuned for 5 epochs with the HuggingFace `Trainer`. Hyperparameters:
|
| 89 |
+
|
| 90 |
+
| Parameter | small / base | large |
|
| 91 |
+
|---|---|---|
|
| 92 |
+
| Learning rate | 5e-6 | 2e-6 |
|
| 93 |
+
| Batch size | 16 | 16 |
|
| 94 |
+
| LR scheduler | linear | linear |
|
| 95 |
+
| Warmup ratio | 0.06 | 0.06 |
|
| 96 |
+
| Weight decay | 0.01 | 0.01 |
|
| 97 |
+
| Max seq length | 128 | 128 |
|
| 98 |
+
|
| 99 |
+
Training data: ~22 000 insect collection label strings with character-span
|
| 100 |
+
annotations for the 10 entity types above.
|