brabus61 commited on
Commit
7153100
1 Parent(s): c5ae176

Customize model card.

Browse files
Files changed (1) hide show
  1. README.md +153 -39
README.md CHANGED
@@ -11,46 +11,160 @@ model-index:
11
  name: NER
12
  type: token-classification
13
  metrics:
14
- - name: NER Precision
15
  type: precision
16
- value: 0.5179593961
17
- - name: NER Recall
18
  type: recall
19
- value: 0.7236363636
20
- - name: NER F Score
21
  type: f_score
22
- value: 0.6037621359
 
 
 
 
 
23
  ---
24
- | Feature | Description |
25
- | --- | --- |
26
- | **Name** | `en_finding_fossils_transformer` |
27
- | **Version** | `0.0.0` |
28
- | **spaCy** | `>=3.5.3,<3.6.0` |
29
- | **Default Pipeline** | `transformer`, `ner` |
30
- | **Components** | `transformer`, `ner` |
31
- | **Vectors** | 514157 keys, 20000 unique vectors (300 dimensions) |
32
- | **Sources** | n/a |
33
- | **License** | n/a |
34
- | **Author** | [n/a]() |
35
-
36
- ### Label Scheme
37
-
38
- <details>
39
-
40
- <summary>View label scheme (7 labels for 1 components)</summary>
41
-
42
- | Component | Labels |
43
- | --- | --- |
44
- | **`ner`** | `AGE`, `ALTI`, `EMAIL`, `GEOG`, `REGION`, `SITE`, `TAXA` |
45
-
46
- </details>
47
-
48
- ### Accuracy
49
-
50
- | Type | Score |
51
- | --- | --- |
52
- | `ENTS_F` | 60.38 |
53
- | `ENTS_P` | 51.80 |
54
- | `ENTS_R` | 72.36 |
55
- | `TRANSFORMER_LOSS` | 31032.58 |
56
- | `NER_LOSS` | 37103.45 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  name: NER
12
  type: token-classification
13
  metrics:
14
+ - name: Token Precision
15
  type: precision
16
+ value: 0.62
17
+ - name: Token Recall
18
  type: recall
19
+ value: 0.84
20
+ - name: Token F1 Score
21
  type: f_score
22
+ value: 0.72
23
+ license: apache-2.0
24
+ metrics:
25
+ - recall
26
+ library_name: transformers
27
+ pipeline_tag: token-classification
28
  ---
29
+
30
+ <img src="https://huggingface.co/finding-fossils/metaextractor/resolve/main/ffossils-logo-text.png" width="400">
31
+
32
+ # Finding Fossils - SpaCy Transformer
33
+
34
+ <!-- Provide a quick summary of what the model is/does. -->
35
+
36
+ This model extracts metadata from research articles related to Paleoecology.
37
+
38
+ The entities detected by this model are:
39
+ - **AGE**: when historical ages are mentioned such as 1234 AD or 4567 BP (before present)
40
+ - **TAXA**: plant or animal taxa names indicating what samples contained
41
+ - **GEOG**: geographic coordinates indicating where samples were excavated from, e.g. 12'34"N 34'23"W
42
+ - **SITE**: site names for where samples were excavated from
43
+ - **REGION**: more general regions to provide context for where sites are located
44
+ - **EMAIL**: researcher emails in the articles able to be used for follow-up contact
45
+ - **ALTI**: altitudes of sites from where samples were excavated, e.g. 123 m a.s.l (above sea level)
46
+
47
+ ## Model Details
48
+
49
+ ### Model Description
50
+
51
+ <!-- Provide a longer summary of what this model is. -->
52
+
53
+ - **Developed by:** Ty Andrews, Jenit Jain, Shaun Hutchinson, Kelly Wu, and Simon Goring
54
+ - **Shared by:** Neotoma Paleocology Database
55
+ - **Model type:** Token Classification
56
+ - **Language(s) (NLP):** English
57
+ - **License:** MIT
58
+ - **Text Embeddings:** roberta-base
59
+ - **Named Entity Recognition:** spacy transition-based S-LSTMs.
60
+
61
+ ### Model Sources
62
+
63
+ <!-- Provide the basic links for the model. -->
64
+
65
+ - **Repository:** https://github.com/NeotomaDB/MetaExtractor
66
+ - **Paper:** https://arxiv.org/pdf/1603.01360.pdf
67
+
68
+ ## Uses
69
+
70
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
71
+
72
+ This model can be used to extract entities from any text that are Paeleoecology related or tangential. Potential uses include identifying unique SITE names in research papers in other domains.
73
+
74
+ ### Direct Use
75
+
76
+ This model is deployed on the xDD (formerly GeoDeepDive) servers where it is getting fed new research articles relevant to Neotoma and returning the extracted data.
77
+
78
+ This approach could be adapted to other domains by using the training and development code found [github.com/NeotomaDB/MetaExtractor](https://github.com/NeotomaDB/MetaExtractor) to run similar data extraction for other research domains.
79
+
80
+ ## Bias, Risks, and Limitations
81
+
82
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
83
+
84
+ This model was trained entirely on English research articles and will likely not perform well on research in other languages. Also, the articles used to train the model were chosen based on being already present in the Neotoma database and therefore may have selection bias as they represent what is already known to be relevant to Neotoma and may not correctly manage new, previously missed articles.
85
+
86
+ ## How to Get Started with the Model
87
+
88
+ Use the code below to get started with the model.
89
+
90
+ ```bash
91
+ pip install https://huggingface.co/brabus61/en_finding_fossils_transformer/resolve/main/en_finding_fossils_transformer-any-py3-none-any.whl
92
+ ```
93
+
94
+ ```python
95
+ # Using spacy.load().
96
+ import spacy
97
+ nlp = spacy.load("en_finding_fossils_transformer")
98
+
99
+ # Importing as module.
100
+ import en_finding_fossils_transformer
101
+ ner_pipe = en_finding_fossils_transformer.load()
102
+ doc = ner_pipe("In Northern Canada, the BGC site core was primarily made up of Pinus pollen.")
103
+
104
+ entities = []
105
+
106
+ for ent in doc.ents:
107
+ entities.append({
108
+ "start": ent.start_char,
109
+ "end": ent.end_char,
110
+ "labels": [ent.label_],
111
+ "text": ent.text
112
+ })
113
+
114
+ print(entities)
115
+
116
+ # Output
117
+ [
118
+ {
119
+ "start": 3,
120
+ "end": 19,
121
+ "labels": ["REGION"],
122
+ "text": " Northern Canada,",
123
+ },
124
+ {
125
+ "start": 24,
126
+ "end": 27,
127
+ "labels": ["SITE"],
128
+ "text": " BGC",
129
+ },
130
+ {
131
+ "start": 63,
132
+ "end": 68,
133
+ "labels": ["TAXA"],
134
+ "text": " Pinus",
135
+ }
136
+ ]
137
+ ```
138
+
139
+ ## Training Details
140
+
141
+ ### Training Data
142
+
143
+ <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
144
+
145
+ The model was trained using a set of 39 research articles deemed relevant to the Neotoma Database. All articles were written in English. The entities were labeled by the project team along with using pre-labelling with early models to speed up the labelling process.
146
+
147
+ A 70/15/15 train/val/test split was used which had the following breakdown of words and entities.
148
+
149
+ | | Train | Validation | Test|
150
+ |---|:---:|:---:|:---:|
151
+ |Articles| 28 | 6 | 6|
152
+ | Words | 220857 | 37809 | 36098 |
153
+ |TAXA Entities | 3352 | 650 | 570 |
154
+ |SITE Entities | 1228 | 177 | 219 |
155
+ | REGION Entities | 2314 | 318 | 258 |
156
+ |GEOG Entities | 188 | 37 | 8 |
157
+ |AGE Entities | 919 | 206 | 153 |
158
+ |ALTI Entities | 99 | 24 | 14 |
159
+ | Email Entities | 14 | 4 | 11 |
160
+
161
+ ### Training Procedure
162
+
163
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
164
+
165
+ For full training details please see the GitHub repository and Wiki: [github.com/NeotomaDB/MetaExtractor](https://github.com/NeotomaDB/MetaExtractor)
166
+
167
+
168
+ ## Results & Metrics
169
+
170
+ For full model results see the report here: [Final Project Report](https://github.com/NeotomaDB/MetaExtractor/blob/main/reports/final/finding-fossils-final.pdf)