tomaarsen HF staff commited on
Commit
5977e24
1 Parent(s): 1a51878

Update README significantly

Browse files
Files changed (1) hide show
  1. README.md +147 -9
README.md CHANGED
@@ -6,6 +6,7 @@ tags:
6
  - token-classification
7
  - ner
8
  - named-entity-recognition
 
9
  pipeline_tag: token-classification
10
  widget:
11
  - text: "Amelia Earhart flew her single engine Lockheed Vega 5B across the Atlantic to Paris."
@@ -46,19 +47,100 @@ metrics:
46
  - precision
47
  ---
48
 
49
- # SpanMarker for Named Entity Recognition
50
 
51
- This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model that can be used for Named Entity Recognition. In particular, this SpanMarker model uses [bert-base-cased](https://huggingface.co/bert-base-cased) as the underlying encoder.
52
 
53
- ## Usage
54
 
55
- To use this model for inference, first install the `span_marker` library:
56
 
57
- ```bash
58
- pip install span_marker
59
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60
 
61
- You can then run inference with this model like so:
62
 
63
  ```python
64
  from span_marker import SpanMarkerModel
@@ -69,4 +151,60 @@ model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-bert-base-fewnerd
69
  entities = model.predict("Amelia Earhart flew her single engine Lockheed Vega 5B across the Atlantic to Paris.")
70
  ```
71
 
72
- See the [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) repository for documentation and additional information on this library.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  - token-classification
7
  - ner
8
  - named-entity-recognition
9
+ - generated_from_span_marker_trainer
10
  pipeline_tag: token-classification
11
  widget:
12
  - text: "Amelia Earhart flew her single engine Lockheed Vega 5B across the Atlantic to Paris."
 
47
  - precision
48
  ---
49
 
50
+ # SpanMarker with bert-base-cased on FewNERD
51
 
52
+ This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model trained on the [FewNERD](https://huggingface.co/datasets/DFKI-SLT/few-nerd) dataset that can be used for Named Entity Recognition. This SpanMarker model uses [bert-base-cased](https://huggingface.co/models/bert-base-cased) as the underlying encoder.
53
 
54
+ ## Model Details
55
 
56
+ ### Model Description
57
 
58
+ - **Model Type:** SpanMarker
59
+ - **Encoder:** [bert-base-cased](https://huggingface.co/models/bert-base-cased)
60
+ - **Maximum Sequence Length:** 256 tokens
61
+ - **Maximum Entity Length:** 8 words
62
+ - **Training Dataset:** [FewNERD](https://huggingface.co/datasets/DFKI-SLT/few-nerd)
63
+ - **Language:** en
64
+ - **License:** cc-by-sa-4.0
65
+
66
+ ### Model Sources
67
+
68
+ - **Repository:** [SpanMarker on GitHub](https://github.com/tomaarsen/SpanMarkerNER)
69
+ - **Thesis:** [SpanMarker For Named Entity Recognition](https://raw.githubusercontent.com/tomaarsen/SpanMarkerNER/main/thesis.pdf)
70
+
71
+ ### Model Labels
72
+ | Label | Examples |
73
+ |:-----------------------------------------|:---------------------------------------------------------------------------------------------------------|
74
+ | art-broadcastprogram | "Street Cents", "Corazones", "The Gale Storm Show : Oh , Susanna" |
75
+ | art-film | "Bosch", "L'Atlantide", "Shawshank Redemption" |
76
+ | art-music | "Atkinson , Danko and Ford ( with Brockie and Hilton )", "Champion Lover", "Hollywood Studio Symphony" |
77
+ | art-other | "Aphrodite of Milos", "Venus de Milo", "The Today Show" |
78
+ | art-painting | "Production/Reproduction", "Touit", "Cofiwch Dryweryn" |
79
+ | art-writtenart | "Imelda de ' Lambertazzi", "Time", "The Seven Year Itch" |
80
+ | building-airport | "Luton Airport", "Newark Liberty International Airport", "Sheremetyevo International Airport" |
81
+ | building-hospital | "Hokkaido University Hospital", "Yeungnam University Hospital", "Memorial Sloan-Kettering Cancer Center" |
82
+ | building-hotel | "The Standard Hotel", "Radisson Blu Sea Plaza Hotel", "Flamingo Hotel" |
83
+ | building-library | "British Library", "Berlin State Library", "Bayerische Staatsbibliothek" |
84
+ | building-other | "Communiplex", "Alpha Recording Studios", "Henry Ford Museum" |
85
+ | building-restaurant | "Fatburger", "Carnegie Deli", "Trumbull" |
86
+ | building-sportsfacility | "Glenn Warner Soccer Facility", "Boston Garden", "Sports Center" |
87
+ | building-theater | "Pittsburgh Civic Light Opera", "Sanders Theatre", "National Paris Opera" |
88
+ | event-attack/battle/war/militaryconflict | "Easter Offensive", "Vietnam War", "Jurist" |
89
+ | event-disaster | "the 1912 North Mount Lyell Disaster", "1693 Sicily earthquake", "1990s North Korean famine" |
90
+ | event-election | "March 1898 elections", "1982 Mitcham and Morden by-election", "Elections to the European Parliament" |
91
+ | event-other | "Eastwood Scoring Stage", "Union for a Popular Movement", "Masaryk Democratic Movement" |
92
+ | event-protest | "French Revolution", "Russian Revolution", "Iranian Constitutional Revolution" |
93
+ | event-sportsevent | "National Champions", "World Cup", "Stanley Cup" |
94
+ | location-GPE | "Mediterranean Basin", "the Republic of Croatia", "Croatian" |
95
+ | location-bodiesofwater | "Atatürk Dam Lake", "Norfolk coast", "Arthur Kill" |
96
+ | location-island | "Laccadives", "Staten Island", "new Samsat district" |
97
+ | location-mountain | "Salamander Glacier", "Miteirya Ridge", "Ruweisat Ridge" |
98
+ | location-other | "Northern City Line", "Victoria line", "Cartuther" |
99
+ | location-park | "Gramercy Park", "Painted Desert Community Complex Historic District", "Shenandoah National Park" |
100
+ | location-road/railway/highway/transit | "Friern Barnet Road", "Newark-Elizabeth Rail Link", "NJT" |
101
+ | organization-company | "Dixy Chicken", "Texas Chicken", "Church 's Chicken" |
102
+ | organization-education | "MIT", "Belfast Royal Academy and the Ulster College of Physical Education", "Barnard College" |
103
+ | organization-government/governmentagency | "Congregazione dei Nobili", "Diet", "Supreme Court" |
104
+ | organization-media/newspaper | "TimeOut Melbourne", "Clash", "Al Jazeera" |
105
+ | organization-other | "Defence Sector C", "IAEA", "4th Army" |
106
+ | organization-politicalparty | "Shimpotō", "Al Wafa ' Islamic", "Kenseitō" |
107
+ | organization-religion | "Jewish", "Christian", "UPCUSA" |
108
+ | organization-showorganization | "Lizzy", "Bochumer Symphoniker", "Mr. Mister" |
109
+ | organization-sportsleague | "China League One", "First Division", "NHL" |
110
+ | organization-sportsteam | "Tottenham", "Arsenal", "Luc Alphand Aventures" |
111
+ | other-astronomything | "Zodiac", "Algol", "`` Caput Larvae ''" |
112
+ | other-award | "GCON", "Order of the Republic of Guinea and Nigeria", "Grand Commander of the Order of the Niger" |
113
+ | other-biologything | "N-terminal lipid", "BAR", "Amphiphysin" |
114
+ | other-chemicalthing | "uranium", "carbon dioxide", "sulfur" |
115
+ | other-currency | "$", "Travancore Rupee", "lac crore" |
116
+ | other-disease | "French Dysentery Epidemic of 1779", "hypothyroidism", "bladder cancer" |
117
+ | other-educationaldegree | "Master", "Bachelor", "BSc ( Hons ) in physics" |
118
+ | other-god | "El", "Fujin", "Raijin" |
119
+ | other-language | "Breton-speaking", "English", "Latin" |
120
+ | other-law | "Thirty Years ' Peace", "Leahy–Smith America Invents Act ( AIA", "United States Freedom Support Act" |
121
+ | other-livingthing | "insects", "monkeys", "patchouli" |
122
+ | other-medical | "Pediatrics", "amitriptyline", "pediatrician" |
123
+ | person-actor | "Ellaline Terriss", "Tchéky Karyo", "Edmund Payne" |
124
+ | person-artist/author | "George Axelrod", "Gaetano Donizett", "Hicks" |
125
+ | person-athlete | "Jaguar", "Neville", "Tozawa" |
126
+ | person-director | "Bob Swaim", "Richard Quine", "Frank Darabont" |
127
+ | person-other | "Richard Benson", "Holden", "Campbell" |
128
+ | person-politician | "William", "Rivière", "Emeric" |
129
+ | person-scholar | "Stedman", "Wurdack", "Stalmine" |
130
+ | person-soldier | "Helmuth Weidling", "Krukenberg", "Joachim Ziegler" |
131
+ | product-airplane | "Luton", "Spey-equipped FGR.2s", "EC135T2 CPDS" |
132
+ | product-car | "100EX", "Corvettes - GT1 C6R", "Phantom" |
133
+ | product-food | "red grape", "yakiniku", "V. labrusca" |
134
+ | product-game | "Airforce Delta", "Hardcore RPG", "Splinter Cell" |
135
+ | product-other | "Fairbottom Bobs", "X11", "PDP-1" |
136
+ | product-ship | "Congress", "Essex", "HMS `` Chinkara ''" |
137
+ | product-software | "AmiPDF", "Apdf", "Wikipedia" |
138
+ | product-train | "High Speed Trains", "55022", "Royal Scots Grey" |
139
+ | product-weapon | "AR-15 's", "ZU-23-2M Wróbel", "ZU-23-2MR Wróbel II" |
140
+
141
+ ## Uses
142
 
143
+ ### Direct Use
144
 
145
  ```python
146
  from span_marker import SpanMarkerModel
 
151
  entities = model.predict("Amelia Earhart flew her single engine Lockheed Vega 5B across the Atlantic to Paris.")
152
  ```
153
 
154
+ ### Downstream Use
155
+ You can finetune this model on your own dataset.
156
+
157
+ <details><summary>Click to expand</summary>
158
+
159
+ ```python
160
+ from span_marker import SpanMarkerModel, Trainer
161
+
162
+ # Download from the 🤗 Hub
163
+ model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-bert-base-fewnerd-fine-super")
164
+
165
+ # Specify a Dataset with "tokens" and "ner_tag" columns
166
+ dataset = load_dataset("conll2003") # For example CoNLL2003
167
+
168
+ # Initialize a Trainer using the pretrained model & dataset
169
+ trainer = Trainer(
170
+ model=model,
171
+ train_dataset=dataset["train"],
172
+ eval_dataset=dataset["validation"],
173
+ )
174
+ trainer.train()
175
+ trainer.save_model("tomaarsen/span-marker-bert-base-fewnerd-fine-super-finetuned")
176
+ ```
177
+ </details>
178
+
179
+ ## Training Details
180
+
181
+ ### Training Set Metrics
182
+ | Training set | Min | Median | Max |
183
+ |:----------------------|:----|:--------|:----|
184
+ | Sentence length | 1 | 24.4945 | 267 |
185
+ | Entities per sentence | 0 | 2.5832 | 88 |
186
+
187
+ ### Training Hyperparameters
188
+ - learning_rate: 5e-05
189
+ - train_batch_size: 32
190
+ - eval_batch_size: 32
191
+ - seed: 42
192
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
193
+ - lr_scheduler_type: linear
194
+ - lr_scheduler_warmup_ratio: 0.1
195
+ - num_epochs: 3
196
+
197
+ ### Training Hardware
198
+ - **On Cloud**: No
199
+ - **GPU Model**: 1 x NVIDIA GeForce RTX 3090
200
+ - **CPU Model**: 13th Gen Intel(R) Core(TM) i7-13700K
201
+ - **RAM Size**: 31.78 GB
202
+
203
+ ### Framework Versions
204
+
205
+ - Python: 3.9.16
206
+ - SpanMarker: 1.3.1.dev
207
+ - Transformers : 4.29.2
208
+ - PyTorch: 2.0.1+cu118
209
+ - Datasets: 2.14.3
210
+ - Tokenizers: 0.13.2