|
--- |
|
base_model: roberta-base |
|
datasets: |
|
- YurtsAI/named_entity_recognition_document_context |
|
language: |
|
- en |
|
library_name: span-marker |
|
metrics: |
|
- precision |
|
- recall |
|
- f1 |
|
pipeline_tag: token-classification |
|
tags: |
|
- span-marker |
|
- token-classification |
|
- ner |
|
- named-entity-recognition |
|
- generated_from_span_marker_trainer |
|
widget: |
|
- text: We have Kanye West, Beyoncé, and Taylor Swift performing at the beachside |
|
park on the island of Maui. |
|
- text: This book, published by Epic Games and sponsored by the University of Hawaii, |
|
features recipes inspired by the popular game League of Legends and a foreword |
|
by renowned food scholar, Dr. Thomas Johnson, a professor at Harvard University. |
|
- text: The National Institute of Technology has partnered with CafeCorp to provide |
|
a menu planning template for businesses in the downtown area. |
|
- text: The marketing efforts for the Chicago Bulls basketball team in Wrigley Park |
|
were a huge success, with 80% of attendees speaking Spanish. |
|
- text: The most important thing was to try using the coconut oil from a tiny store |
|
near the river, and a sprinkle of Japanese spices I learned from my friend who |
|
speaks fluent Japanese. |
|
model-index: |
|
- name: SpanMarker with roberta-base on YurtsAI/named_entity_recognition_document_context |
|
results: |
|
- task: |
|
type: token-classification |
|
name: Named Entity Recognition |
|
dataset: |
|
name: Unknown |
|
type: YurtsAI/named_entity_recognition_document_context |
|
split: eval |
|
metrics: |
|
- type: f1 |
|
value: 0.3902777777777778 |
|
name: F1 |
|
- type: precision |
|
value: 0.6189427312775331 |
|
name: Precision |
|
- type: recall |
|
value: 0.28498985801217036 |
|
name: Recall |
|
--- |
|
|
|
# SpanMarker with roberta-base on YurtsAI/named_entity_recognition_document_context |
|
|
|
This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model trained on the [YurtsAI/named_entity_recognition_document_context](https://huggingface.co/datasets/YurtsAI/named_entity_recognition_document_context) dataset that can be used for Named Entity Recognition. This SpanMarker model uses [roberta-base](https://huggingface.co/roberta-base) as the underlying encoder. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
- **Model Type:** SpanMarker |
|
- **Encoder:** [roberta-base](https://huggingface.co/roberta-base) |
|
- **Maximum Sequence Length:** 256 tokens |
|
- **Maximum Entity Length:** 11 words |
|
- **Training Dataset:** [YurtsAI/named_entity_recognition_document_context](https://huggingface.co/datasets/YurtsAI/named_entity_recognition_document_context) |
|
- **Language:** en |
|
<!-- - **License:** Unknown --> |
|
|
|
### Model Sources |
|
|
|
- **Repository:** [SpanMarker on GitHub](https://github.com/tomaarsen/SpanMarkerNER) |
|
- **Thesis:** [SpanMarker For Named Entity Recognition](https://raw.githubusercontent.com/tomaarsen/SpanMarkerNER/main/thesis.pdf) |
|
|
|
### Model Labels |
|
| Label | Examples | |
|
|:-----------------------------------------|:------------------------------------------------------------------------------------------------------------------------------| |
|
| art-broadcastprogram | "television program", "Origin of the Gods", "reality show" | |
|
| art-film | "a video of a successful grant proposal", "'The Matrix '", "film crew" | |
|
| art-music | "a new album by Beyoncé", "Yesterday by The Beatles", "favorite music CD" | |
|
| art-other | "art therapy", "play", "Mona Lisa" | |
|
| art-painting | "vibrant street art scene", "through art", "painting" | |
|
| art-writtenart | "'The Lost Gods '", "Book 1", "environmental science book" | |
|
| building-airport | "airport", "major airport", "an airport" | |
|
| building-hospital | "New York hospital", "local hospital", "hospital" | |
|
| building-hotel | "hotel", "new hotel in Austin", "a giant hotel" | |
|
| building-library | "new library", "library", "new , state-of-the-art library" | |
|
| building-other | "10-story building", "headquarters building", "factory building" | |
|
| building-restaurant | "new restaurant", "our upscale restaurant", "restaurant" | |
|
| building-sportsfacility | "sports facility", "Union Park Sports Complex", "city 's sports center" | |
|
| building-theater | "the local theater", "theater in downtown", "theater" | |
|
| datetime-absolute | "January 10 , 2020", "January 17 , 2025 at 14:00", "March 25th" | |
|
| datetime-authored | "2023-02-22", "2019-04-15", "2020-02-15" | |
|
| datetime-range | "2010-2015", "Q4 2019", "Friday to Sunday" | |
|
| datetime-relative | "next week 's appointment", "last Saturday", "next week" | |
|
| event-attack/battle/war/militaryconflict | "attacks/wars", "The", "A" | |
|
| event-disaster | "My", "To", "disaster" | |
|
| event-election | "the election for the mayor", "upcoming election", "election season" | |
|
| event-other | "conference", "annual 4th of july BBQ", "charity gala" | |
|
| event-protest | "protest", "protest last saturday", "protest rally" | |
|
| event-sportsevent | "sports event", "annual tennis tournament", "biggest sports event of the year" | |
|
| location-bodiesofwater | "ocean", "Lake Como", "Lake Michigan" | |
|
| location-gpe | "Italy", "Texas", "city" | |
|
| location-island | "Island Radio", "Caribbean island", "island" | |
|
| location-mountain | "mountain terrain", "the mountain", "mountain" | |
|
| location-other | "low-lying areas of the city", "advertising hub", "backyard" | |
|
| location-park | "park", "location-park", "the park" | |
|
| location-road/railway/highway/transit | "Greyhound network", "road", "train journey" | |
|
| organization-company | "local company", "Verizon", "a company" | |
|
| organization-education | "Harvard University", "UW", "University of Arizona" | |
|
| organization-government/governmentagency | "Red Cross", "local government", "SEC" | |
|
| organization-media/newspaper | "The New York Times", "media organizations", "Army Times" | |
|
| organization-other | "Cognizant", "Better World Foundation", "conservation organization" | |
|
| organization-politicalparty | "Spaceship of Progress Party", "Libertarian Party", "Green Party" | |
|
| organization-religion | "local church", "the power of prayer", "diamatists" | |
|
| organization-showorganization | "Royal Shakespeare Company", "Earth 's Edge Theater Company", "Cosmic Theater group" | |
|
| organization-sportsleague | "International Swimming Federation", "NBA league", "NFL" | |
|
| organization-sportsteam | "soccer team", "Syracuse Orange football team", "Seattle Seahawks" | |
|
| other-astronomything | "latest discoveries in the field of astronomy", "Galactic Conference Best Recipe Award-winning recipe book", "astronomy camp" | |
|
| other-award | "other-award", "annual tech show awards", "Nobel Peace Prize" | |
|
| other-biologything | "salmon 's gene for cold adaptation", "terrain", "the forces that drive you" | |
|
| other-chemicalthing | "Overall", "The", "In" | |
|
| other-currency | "US dollars", "Japanese Yen", "$ 500,000" | |
|
| other-disease | "malaria", "type 1 diabetes", "the common cold" | |
|
| other-educationaldegree | "master 's degree", "thesis", "Ph.D in food science" | |
|
| other-god | "Peter Pan", "divine", "Zeus the god" | |
|
| other-language | "English", "Amharic", "Sanskrit" | |
|
| other-law | "legislation", "professorial separation laws", "Clean Air Act" | |
|
| other-livingthing | "We", "To", "flowers" | |
|
| other-medical | "antibiotics", "medical treatment", "necessary testing protocols" | |
|
| person-actor | "Emma Stone", "Dr. Steven Spielberg", "Jennifer Lawrence" | |
|
| person-artist/author | "Chuck Close", "artist 's new album", "Jane Smith" | |
|
| person-athlete | "athlete friend", "LeBron James", "John and Sally" | |
|
| person-director | "John Oliver", "favorite director", "Dr. Johnson" | |
|
| person-other | "your", "HR representative", "therapist or counselor" | |
|
| person-politician | "To", "At", "Secretary of State" | |
|
| person-scholar | "Dr. John Smith", "Dr. Johnson", "a scholar of comparative religion" | |
|
| person-soldier | "veterans", "the brave soldiers", "a soldier" | |
|
| product-airplane | "Cessna 172", "company 's fleet of private airplanes", "airline" | |
|
| product-car | "leased car", "your car", "car" | |
|
| product-food | "StarBites", "food truck business", "ice cream" | |
|
| product-game | "the 'Train to Nowhere ' game", "board game", "screen protector" | |
|
| product-other | "new medicine", "acting software", "table" | |
|
| product-ship | "research ship", "ship", "a ship" | |
|
| product-software | "software", "instruction manual", "pizza ordering app" | |
|
| product-train | "Universal Sonicator", "train", "the train" | |
|
| product-weapon | "Flip Flops", "Sno Blaster", "SecurityFirst" | |
|
|
|
## Evaluation |
|
|
|
### Metrics |
|
| Label | Precision | Recall | F1 | |
|
|:-----------------------------------------|:----------|:-------|:-------| |
|
| **all** | 0.6189 | 0.2850 | 0.3903 | |
|
| art-broadcastprogram | 0.0 | 0.0 | 0.0 | |
|
| art-film | 0.0 | 0.0 | 0.0 | |
|
| art-music | 0.6667 | 0.2 | 0.3077 | |
|
| art-other | 0.0 | 0.0 | 0.0 | |
|
| art-painting | 0.0 | 0.0 | 0.0 | |
|
| art-writtenart | 0.0 | 0.0 | 0.0 | |
|
| building-airport | 0.7143 | 0.7692 | 0.7407 | |
|
| building-hospital | 0.6667 | 0.7778 | 0.7179 | |
|
| building-hotel | 0.7857 | 0.6875 | 0.7333 | |
|
| building-library | 0.8182 | 0.75 | 0.7826 | |
|
| building-other | 0.0 | 0.0 | 0.0 | |
|
| building-restaurant | 0.8571 | 0.375 | 0.5217 | |
|
| building-sportsfacility | 0.6667 | 0.5 | 0.5714 | |
|
| building-theater | 0.9 | 0.5625 | 0.6923 | |
|
| datetime-absolute | 0.3333 | 0.0769 | 0.125 | |
|
| datetime-authored | 0.55 | 0.8462 | 0.6667 | |
|
| datetime-range | 0.75 | 0.5 | 0.6 | |
|
| datetime-relative | 0.0 | 0.0 | 0.0 | |
|
| event-attack/battle/war/militaryconflict | 0.8 | 0.2857 | 0.4211 | |
|
| event-disaster | 0.5385 | 0.5 | 0.5185 | |
|
| event-election | 0.75 | 0.5 | 0.6 | |
|
| event-other | 0.0 | 0.0 | 0.0 | |
|
| event-protest | 0.5455 | 0.4615 | 0.5000 | |
|
| event-sportsevent | 0.625 | 0.3846 | 0.4762 | |
|
| location-bodiesofwater | 0.8333 | 0.3571 | 0.5 | |
|
| location-gpe | 0.375 | 0.2143 | 0.2727 | |
|
| location-island | 0.7143 | 0.3333 | 0.4545 | |
|
| location-mountain | 0.5882 | 0.625 | 0.6061 | |
|
| location-other | 0.0 | 0.0 | 0.0 | |
|
| location-park | 0.6667 | 0.5 | 0.5714 | |
|
| location-road/railway/highway/transit | 0.8 | 0.5333 | 0.64 | |
|
| organization-company | 0.0 | 0.0 | 0.0 | |
|
| organization-education | 0.3077 | 0.2857 | 0.2963 | |
|
| organization-government/governmentagency | 0.25 | 0.0909 | 0.1333 | |
|
| organization-media/newspaper | 0.5833 | 0.4667 | 0.5185 | |
|
| organization-other | 1.0 | 0.0769 | 0.1429 | |
|
| organization-politicalparty | 0.75 | 0.2727 | 0.4000 | |
|
| organization-religion | 1.0 | 0.3077 | 0.4706 | |
|
| organization-showorganization | 0.75 | 0.25 | 0.375 | |
|
| organization-sportsleague | 0.8571 | 0.4286 | 0.5714 | |
|
| organization-sportsteam | 0.4286 | 0.5 | 0.4615 | |
|
| other-astronomything | 0.0 | 0.0 | 0.0 | |
|
| other-award | 1.0 | 0.2143 | 0.3529 | |
|
| other-biologything | 0.0 | 0.0 | 0.0 | |
|
| other-chemicalthing | 0.4 | 0.3077 | 0.3478 | |
|
| other-currency | 1.0 | 0.2143 | 0.3529 | |
|
| other-disease | 0.5714 | 0.3077 | 0.4 | |
|
| other-educationaldegree | 0.5833 | 0.5833 | 0.5833 | |
|
| other-god | 0.8 | 0.2222 | 0.3478 | |
|
| other-language | 0.8 | 0.2857 | 0.4211 | |
|
| other-law | 0.6667 | 0.5 | 0.5714 | |
|
| other-livingthing | 0.0 | 0.0 | 0.0 | |
|
| other-medical | 0.0 | 0.0 | 0.0 | |
|
| person-actor | 0.3448 | 0.5 | 0.4082 | |
|
| person-artist/author | 0.6667 | 0.1429 | 0.2353 | |
|
| person-athlete | 0.6667 | 0.2353 | 0.3478 | |
|
| person-director | 0.2 | 0.0714 | 0.1053 | |
|
| person-other | 0.0 | 0.0 | 0.0 | |
|
| person-politician | 0.6667 | 0.0952 | 0.1667 | |
|
| person-scholar | 0.4118 | 0.4667 | 0.4375 | |
|
| person-soldier | 0.0 | 0.0 | 0.0 | |
|
| product-airplane | 0.75 | 0.3333 | 0.4615 | |
|
| product-car | 1.0 | 0.2143 | 0.3529 | |
|
| product-food | 0.0 | 0.0 | 0.0 | |
|
| product-game | 1.0 | 0.1333 | 0.2353 | |
|
| product-other | 0.5 | 0.0909 | 0.1538 | |
|
| product-ship | 0.75 | 0.3 | 0.4286 | |
|
| product-software | 1.0 | 0.4167 | 0.5882 | |
|
| product-train | 0.5556 | 0.3571 | 0.4348 | |
|
| product-weapon | 0.3333 | 0.0625 | 0.1053 | |
|
|
|
## Uses |
|
|
|
### Direct Use for Inference |
|
|
|
```python |
|
from span_marker import SpanMarkerModel |
|
|
|
# Download from the 🤗 Hub |
|
model = SpanMarkerModel.from_pretrained("YurtsAI/named_entity_recognition_document_context") |
|
# Run inference |
|
entities = model.predict("We have Kanye West, Beyoncé, and Taylor Swift performing at the beachside park on the island of Maui.") |
|
``` |
|
|
|
### Downstream Use |
|
You can finetune this model on your own dataset. |
|
|
|
<details><summary>Click to expand</summary> |
|
|
|
```python |
|
from span_marker import SpanMarkerModel, Trainer |
|
|
|
# Download from the 🤗 Hub |
|
model = SpanMarkerModel.from_pretrained("YurtsAI/named_entity_recognition_document_context") |
|
|
|
# Specify a Dataset with "tokens" and "ner_tag" columns |
|
dataset = load_dataset("conll2003") # For example CoNLL2003 |
|
|
|
# Initialize a Trainer using the pretrained model & dataset |
|
trainer = Trainer( |
|
model=model, |
|
train_dataset=dataset["train"], |
|
eval_dataset=dataset["validation"], |
|
) |
|
trainer.train() |
|
trainer.save_model("YurtsAI/named_entity_recognition_document_context-finetuned") |
|
``` |
|
</details> |
|
|
|
<!-- |
|
### Out-of-Scope Use |
|
|
|
*List how the model may foreseeably be misused and address what users ought not to do with the model.* |
|
--> |
|
|
|
<!-- |
|
## Bias, Risks and Limitations |
|
|
|
*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.* |
|
--> |
|
|
|
<!-- |
|
### Recommendations |
|
|
|
*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.* |
|
--> |
|
|
|
## Training Details |
|
|
|
### Training Set Metrics |
|
| Training set | Min | Median | Max | |
|
|:----------------------|:----|:--------|:----| |
|
| Sentence length | 1 | 18.4126 | 309 | |
|
| Entities per sentence | 0 | 0.9794 | 5 | |
|
|
|
### Training Hyperparameters |
|
- learning_rate: 1e-05 |
|
- train_batch_size: 4 |
|
- eval_batch_size: 4 |
|
- seed: 42 |
|
- gradient_accumulation_steps: 2 |
|
- total_train_batch_size: 8 |
|
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 |
|
- lr_scheduler_type: linear |
|
- lr_scheduler_warmup_ratio: 0.1 |
|
- num_epochs: 3 |
|
|
|
### Training Results |
|
| Epoch | Step | Validation Loss | Validation Precision | Validation Recall | Validation F1 | Validation Accuracy | |
|
|:------:|:----:|:---------------:|:--------------------:|:-----------------:|:-------------:|:-------------------:| |
|
| 0.4322 | 500 | 0.0503 | 0.0 | 0.0 | 0.0 | 0.8898 | |
|
| 0.8643 | 1000 | 0.0435 | 1.0 | 0.0010 | 0.0020 | 0.8900 | |
|
| 1.2965 | 1500 | 0.0383 | 0.2841 | 0.0254 | 0.0466 | 0.8908 | |
|
| 1.7286 | 2000 | 0.0326 | 0.5556 | 0.0710 | 0.1259 | 0.8951 | |
|
| 2.1608 | 2500 | 0.0294 | 0.5806 | 0.1826 | 0.2778 | 0.9032 | |
|
| 2.5929 | 3000 | 0.0278 | 0.6259 | 0.2698 | 0.3770 | 0.9109 | |
|
|
|
### Framework Versions |
|
- Python: 3.12.2 |
|
- SpanMarker: 1.5.0 |
|
- Transformers: 4.41.2 |
|
- PyTorch: 2.3.1 |
|
- Datasets: 2.20.0 |
|
- Tokenizers: 0.19.1 |
|
|
|
## Citation |
|
|
|
### BibTeX |
|
``` |
|
@software{Aarsen_SpanMarker, |
|
author = {Aarsen, Tom}, |
|
license = {Apache-2.0}, |
|
title = {{SpanMarker for Named Entity Recognition}}, |
|
url = {https://github.com/tomaarsen/SpanMarkerNER} |
|
} |
|
``` |
|
|
|
<!-- |
|
## Glossary |
|
|
|
*Clearly define terms in order to be accessible across audiences.* |
|
--> |
|
|
|
<!-- |
|
## Model Card Authors |
|
|
|
*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.* |
|
--> |
|
|
|
<!-- |
|
## Model Card Contact |
|
|
|
*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.* |
|
--> |