YurtsAI/ner-document-context

a1e319a verified 3 days ago

No virus

24.7 kB

	---
	base_model: roberta-base
	datasets:
	- YurtsAI/named_entity_recognition_document_context
	language:
	- en
	library_name: span-marker
	metrics:
	- precision
	- recall
	- f1
	pipeline_tag: token-classification
	tags:
	- span-marker
	- token-classification
	- ner
	- named-entity-recognition
	- generated_from_span_marker_trainer
	widget:
	- text: We have Kanye West, Beyoncé, and Taylor Swift performing at the beachside
	park on the island of Maui.
	- text: This book, published by Epic Games and sponsored by the University of Hawaii,
	features recipes inspired by the popular game League of Legends and a foreword
	by renowned food scholar, Dr. Thomas Johnson, a professor at Harvard University.
	- text: The National Institute of Technology has partnered with CafeCorp to provide
	a menu planning template for businesses in the downtown area.
	- text: The marketing efforts for the Chicago Bulls basketball team in Wrigley Park
	were a huge success, with 80% of attendees speaking Spanish.
	- text: The most important thing was to try using the coconut oil from a tiny store
	near the river, and a sprinkle of Japanese spices I learned from my friend who
	speaks fluent Japanese.
	model-index:
	- name: SpanMarker with roberta-base on YurtsAI/named_entity_recognition_document_context
	results:
	- task:
	type: token-classification
	name: Named Entity Recognition
	dataset:
	name: Unknown
	type: YurtsAI/named_entity_recognition_document_context
	split: eval
	metrics:
	- type: f1
	value: 0.3902777777777778
	name: F1
	- type: precision
	value: 0.6189427312775331
	name: Precision
	- type: recall
	value: 0.28498985801217036
	name: Recall
	---

	# SpanMarker with roberta-base on YurtsAI/named_entity_recognition_document_context

	This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model trained on the [YurtsAI/named_entity_recognition_document_context](https://huggingface.co/datasets/YurtsAI/named_entity_recognition_document_context) dataset that can be used for Named Entity Recognition. This SpanMarker model uses [roberta-base](https://huggingface.co/roberta-base) as the underlying encoder.

	## Model Details

	### Model Description
	- Model Type: SpanMarker
	- Encoder: [roberta-base](https://huggingface.co/roberta-base)
	- Maximum Sequence Length: 256 tokens
	- Maximum Entity Length: 11 words
	- Training Dataset: [YurtsAI/named_entity_recognition_document_context](https://huggingface.co/datasets/YurtsAI/named_entity_recognition_document_context)
	- Language: en
	<!-- - License: Unknown -->

	### Model Sources

	- Repository: [SpanMarker on GitHub](https://github.com/tomaarsen/SpanMarkerNER)
	- Thesis: [SpanMarker For Named Entity Recognition](https://raw.githubusercontent.com/tomaarsen/SpanMarkerNER/main/thesis.pdf)

	### Model Labels
	\| Label \| Examples \|
	\|:-----------------------------------------\|:------------------------------------------------------------------------------------------------------------------------------\|
	\| art-broadcastprogram \| "television program", "Origin of the Gods", "reality show" \|
	\| art-film \| "a video of a successful grant proposal", "'The Matrix '", "film crew" \|
	\| art-music \| "a new album by Beyoncé", "Yesterday by The Beatles", "favorite music CD" \|
	\| art-other \| "art therapy", "play", "Mona Lisa" \|
	\| art-painting \| "vibrant street art scene", "through art", "painting" \|
	\| art-writtenart \| "'The Lost Gods '", "Book 1", "environmental science book" \|
	\| building-airport \| "airport", "major airport", "an airport" \|
	\| building-hospital \| "New York hospital", "local hospital", "hospital" \|
	\| building-hotel \| "hotel", "new hotel in Austin", "a giant hotel" \|
	\| building-library \| "new library", "library", "new , state-of-the-art library" \|
	\| building-other \| "10-story building", "headquarters building", "factory building" \|
	\| building-restaurant \| "new restaurant", "our upscale restaurant", "restaurant" \|
	\| building-sportsfacility \| "sports facility", "Union Park Sports Complex", "city 's sports center" \|
	\| building-theater \| "the local theater", "theater in downtown", "theater" \|
	\| datetime-absolute \| "January 10 , 2020", "January 17 , 2025 at 14:00", "March 25th" \|
	\| datetime-authored \| "2023-02-22", "2019-04-15", "2020-02-15" \|
	\| datetime-range \| "2010-2015", "Q4 2019", "Friday to Sunday" \|
	\| datetime-relative \| "next week 's appointment", "last Saturday", "next week" \|
	\| event-attack/battle/war/militaryconflict \| "attacks/wars", "The", "A" \|
	\| event-disaster \| "My", "To", "disaster" \|
	\| event-election \| "the election for the mayor", "upcoming election", "election season" \|
	\| event-other \| "conference", "annual 4th of july BBQ", "charity gala" \|
	\| event-protest \| "protest", "protest last saturday", "protest rally" \|
	\| event-sportsevent \| "sports event", "annual tennis tournament", "biggest sports event of the year" \|
	\| location-bodiesofwater \| "ocean", "Lake Como", "Lake Michigan" \|
	\| location-gpe \| "Italy", "Texas", "city" \|
	\| location-island \| "Island Radio", "Caribbean island", "island" \|
	\| location-mountain \| "mountain terrain", "the mountain", "mountain" \|
	\| location-other \| "low-lying areas of the city", "advertising hub", "backyard" \|
	\| location-park \| "park", "location-park", "the park" \|
	\| location-road/railway/highway/transit \| "Greyhound network", "road", "train journey" \|
	\| organization-company \| "local company", "Verizon", "a company" \|
	\| organization-education \| "Harvard University", "UW", "University of Arizona" \|
	\| organization-government/governmentagency \| "Red Cross", "local government", "SEC" \|
	\| organization-media/newspaper \| "The New York Times", "media organizations", "Army Times" \|
	\| organization-other \| "Cognizant", "Better World Foundation", "conservation organization" \|
	\| organization-politicalparty \| "Spaceship of Progress Party", "Libertarian Party", "Green Party" \|
	\| organization-religion \| "local church", "the power of prayer", "diamatists" \|
	\| organization-showorganization \| "Royal Shakespeare Company", "Earth 's Edge Theater Company", "Cosmic Theater group" \|
	\| organization-sportsleague \| "International Swimming Federation", "NBA league", "NFL" \|
	\| organization-sportsteam \| "soccer team", "Syracuse Orange football team", "Seattle Seahawks" \|
	\| other-astronomything \| "latest discoveries in the field of astronomy", "Galactic Conference Best Recipe Award-winning recipe book", "astronomy camp" \|
	\| other-award \| "other-award", "annual tech show awards", "Nobel Peace Prize" \|
	\| other-biologything \| "salmon 's gene for cold adaptation", "terrain", "the forces that drive you" \|
	\| other-chemicalthing \| "Overall", "The", "In" \|
	\| other-currency \| "US dollars", "Japanese Yen", "$ 500,000" \|
	\| other-disease \| "malaria", "type 1 diabetes", "the common cold" \|
	\| other-educationaldegree \| "master 's degree", "thesis", "Ph.D in food science" \|
	\| other-god \| "Peter Pan", "divine", "Zeus the god" \|
	\| other-language \| "English", "Amharic", "Sanskrit" \|
	\| other-law \| "legislation", "professorial separation laws", "Clean Air Act" \|
	\| other-livingthing \| "We", "To", "flowers" \|
	\| other-medical \| "antibiotics", "medical treatment", "necessary testing protocols" \|
	\| person-actor \| "Emma Stone", "Dr. Steven Spielberg", "Jennifer Lawrence" \|
	\| person-artist/author \| "Chuck Close", "artist 's new album", "Jane Smith" \|
	\| person-athlete \| "athlete friend", "LeBron James", "John and Sally" \|
	\| person-director \| "John Oliver", "favorite director", "Dr. Johnson" \|
	\| person-other \| "your", "HR representative", "therapist or counselor" \|
	\| person-politician \| "To", "At", "Secretary of State" \|
	\| person-scholar \| "Dr. John Smith", "Dr. Johnson", "a scholar of comparative religion" \|
	\| person-soldier \| "veterans", "the brave soldiers", "a soldier" \|
	\| product-airplane \| "Cessna 172", "company 's fleet of private airplanes", "airline" \|
	\| product-car \| "leased car", "your car", "car" \|
	\| product-food \| "StarBites", "food truck business", "ice cream" \|
	\| product-game \| "the 'Train to Nowhere ' game", "board game", "screen protector" \|
	\| product-other \| "new medicine", "acting software", "table" \|
	\| product-ship \| "research ship", "ship", "a ship" \|
	\| product-software \| "software", "instruction manual", "pizza ordering app" \|
	\| product-train \| "Universal Sonicator", "train", "the train" \|
	\| product-weapon \| "Flip Flops", "Sno Blaster", "SecurityFirst" \|

	## Evaluation

	### Metrics
	\| Label \| Precision \| Recall \| F1 \|
	\|:-----------------------------------------\|:----------\|:-------\|:-------\|
	\| all \| 0.6189 \| 0.2850 \| 0.3903 \|
	\| art-broadcastprogram \| 0.0 \| 0.0 \| 0.0 \|
	\| art-film \| 0.0 \| 0.0 \| 0.0 \|
	\| art-music \| 0.6667 \| 0.2 \| 0.3077 \|
	\| art-other \| 0.0 \| 0.0 \| 0.0 \|
	\| art-painting \| 0.0 \| 0.0 \| 0.0 \|
	\| art-writtenart \| 0.0 \| 0.0 \| 0.0 \|
	\| building-airport \| 0.7143 \| 0.7692 \| 0.7407 \|
	\| building-hospital \| 0.6667 \| 0.7778 \| 0.7179 \|
	\| building-hotel \| 0.7857 \| 0.6875 \| 0.7333 \|
	\| building-library \| 0.8182 \| 0.75 \| 0.7826 \|
	\| building-other \| 0.0 \| 0.0 \| 0.0 \|
	\| building-restaurant \| 0.8571 \| 0.375 \| 0.5217 \|
	\| building-sportsfacility \| 0.6667 \| 0.5 \| 0.5714 \|
	\| building-theater \| 0.9 \| 0.5625 \| 0.6923 \|
	\| datetime-absolute \| 0.3333 \| 0.0769 \| 0.125 \|
	\| datetime-authored \| 0.55 \| 0.8462 \| 0.6667 \|
	\| datetime-range \| 0.75 \| 0.5 \| 0.6 \|
	\| datetime-relative \| 0.0 \| 0.0 \| 0.0 \|
	\| event-attack/battle/war/militaryconflict \| 0.8 \| 0.2857 \| 0.4211 \|
	\| event-disaster \| 0.5385 \| 0.5 \| 0.5185 \|
	\| event-election \| 0.75 \| 0.5 \| 0.6 \|
	\| event-other \| 0.0 \| 0.0 \| 0.0 \|
	\| event-protest \| 0.5455 \| 0.4615 \| 0.5000 \|
	\| event-sportsevent \| 0.625 \| 0.3846 \| 0.4762 \|
	\| location-bodiesofwater \| 0.8333 \| 0.3571 \| 0.5 \|
	\| location-gpe \| 0.375 \| 0.2143 \| 0.2727 \|
	\| location-island \| 0.7143 \| 0.3333 \| 0.4545 \|
	\| location-mountain \| 0.5882 \| 0.625 \| 0.6061 \|
	\| location-other \| 0.0 \| 0.0 \| 0.0 \|
	\| location-park \| 0.6667 \| 0.5 \| 0.5714 \|
	\| location-road/railway/highway/transit \| 0.8 \| 0.5333 \| 0.64 \|
	\| organization-company \| 0.0 \| 0.0 \| 0.0 \|
	\| organization-education \| 0.3077 \| 0.2857 \| 0.2963 \|
	\| organization-government/governmentagency \| 0.25 \| 0.0909 \| 0.1333 \|
	\| organization-media/newspaper \| 0.5833 \| 0.4667 \| 0.5185 \|
	\| organization-other \| 1.0 \| 0.0769 \| 0.1429 \|
	\| organization-politicalparty \| 0.75 \| 0.2727 \| 0.4000 \|
	\| organization-religion \| 1.0 \| 0.3077 \| 0.4706 \|
	\| organization-showorganization \| 0.75 \| 0.25 \| 0.375 \|
	\| organization-sportsleague \| 0.8571 \| 0.4286 \| 0.5714 \|
	\| organization-sportsteam \| 0.4286 \| 0.5 \| 0.4615 \|
	\| other-astronomything \| 0.0 \| 0.0 \| 0.0 \|
	\| other-award \| 1.0 \| 0.2143 \| 0.3529 \|
	\| other-biologything \| 0.0 \| 0.0 \| 0.0 \|
	\| other-chemicalthing \| 0.4 \| 0.3077 \| 0.3478 \|
	\| other-currency \| 1.0 \| 0.2143 \| 0.3529 \|
	\| other-disease \| 0.5714 \| 0.3077 \| 0.4 \|
	\| other-educationaldegree \| 0.5833 \| 0.5833 \| 0.5833 \|
	\| other-god \| 0.8 \| 0.2222 \| 0.3478 \|
	\| other-language \| 0.8 \| 0.2857 \| 0.4211 \|
	\| other-law \| 0.6667 \| 0.5 \| 0.5714 \|
	\| other-livingthing \| 0.0 \| 0.0 \| 0.0 \|
	\| other-medical \| 0.0 \| 0.0 \| 0.0 \|
	\| person-actor \| 0.3448 \| 0.5 \| 0.4082 \|
	\| person-artist/author \| 0.6667 \| 0.1429 \| 0.2353 \|
	\| person-athlete \| 0.6667 \| 0.2353 \| 0.3478 \|
	\| person-director \| 0.2 \| 0.0714 \| 0.1053 \|
	\| person-other \| 0.0 \| 0.0 \| 0.0 \|
	\| person-politician \| 0.6667 \| 0.0952 \| 0.1667 \|
	\| person-scholar \| 0.4118 \| 0.4667 \| 0.4375 \|
	\| person-soldier \| 0.0 \| 0.0 \| 0.0 \|
	\| product-airplane \| 0.75 \| 0.3333 \| 0.4615 \|
	\| product-car \| 1.0 \| 0.2143 \| 0.3529 \|
	\| product-food \| 0.0 \| 0.0 \| 0.0 \|
	\| product-game \| 1.0 \| 0.1333 \| 0.2353 \|
	\| product-other \| 0.5 \| 0.0909 \| 0.1538 \|
	\| product-ship \| 0.75 \| 0.3 \| 0.4286 \|
	\| product-software \| 1.0 \| 0.4167 \| 0.5882 \|
	\| product-train \| 0.5556 \| 0.3571 \| 0.4348 \|
	\| product-weapon \| 0.3333 \| 0.0625 \| 0.1053 \|

	## Uses

	### Direct Use for Inference

	```python
	from span_marker import SpanMarkerModel

	# Download from the 🤗 Hub
	model = SpanMarkerModel.from_pretrained("YurtsAI/named_entity_recognition_document_context")
	# Run inference
	entities = model.predict("We have Kanye West, Beyoncé, and Taylor Swift performing at the beachside park on the island of Maui.")
	```

	### Downstream Use
	You can finetune this model on your own dataset.

	<details><summary>Click to expand</summary>

	```python
	from span_marker import SpanMarkerModel, Trainer

	# Download from the 🤗 Hub
	model = SpanMarkerModel.from_pretrained("YurtsAI/named_entity_recognition_document_context")

	# Specify a Dataset with "tokens" and "ner_tag" columns
	dataset = load_dataset("conll2003") # For example CoNLL2003

	# Initialize a Trainer using the pretrained model & dataset
	trainer = Trainer(
	model=model,
	train_dataset=dataset["train"],
	eval_dataset=dataset["validation"],
	)
	trainer.train()
	trainer.save_model("YurtsAI/named_entity_recognition_document_context-finetuned")
	```
	</details>

	<!--
	### Out-of-Scope Use

	List how the model may foreseeably be misused and address what users ought not to do with the model.
	-->

	<!--
	## Bias, Risks and Limitations

	What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.
	-->

	<!--
	### Recommendations

	What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.
	-->

	## Training Details

	### Training Set Metrics
	\| Training set \| Min \| Median \| Max \|
	\|:----------------------\|:----\|:--------\|:----\|
	\| Sentence length \| 1 \| 18.4126 \| 309 \|
	\| Entities per sentence \| 0 \| 0.9794 \| 5 \|

	### Training Hyperparameters
	- learning_rate: 1e-05
	- train_batch_size: 4
	- eval_batch_size: 4
	- seed: 42
	- gradient_accumulation_steps: 2
	- total_train_batch_size: 8
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- lr_scheduler_warmup_ratio: 0.1
	- num_epochs: 3

	### Training Results
	\| Epoch \| Step \| Validation Loss \| Validation Precision \| Validation Recall \| Validation F1 \| Validation Accuracy \|
	\|:------:\|:----:\|:---------------:\|:--------------------:\|:-----------------:\|:-------------:\|:-------------------:\|
	\| 0.4322 \| 500 \| 0.0503 \| 0.0 \| 0.0 \| 0.0 \| 0.8898 \|
	\| 0.8643 \| 1000 \| 0.0435 \| 1.0 \| 0.0010 \| 0.0020 \| 0.8900 \|
	\| 1.2965 \| 1500 \| 0.0383 \| 0.2841 \| 0.0254 \| 0.0466 \| 0.8908 \|
	\| 1.7286 \| 2000 \| 0.0326 \| 0.5556 \| 0.0710 \| 0.1259 \| 0.8951 \|
	\| 2.1608 \| 2500 \| 0.0294 \| 0.5806 \| 0.1826 \| 0.2778 \| 0.9032 \|
	\| 2.5929 \| 3000 \| 0.0278 \| 0.6259 \| 0.2698 \| 0.3770 \| 0.9109 \|

	### Framework Versions
	- Python: 3.12.2
	- SpanMarker: 1.5.0
	- Transformers: 4.41.2
	- PyTorch: 2.3.1
	- Datasets: 2.20.0
	- Tokenizers: 0.19.1

	## Citation

	### BibTeX
	```
	@software{Aarsen_SpanMarker,
	author = {Aarsen, Tom},
	license = {Apache-2.0},
	title = {{SpanMarker for Named Entity Recognition}},
	url = {https://github.com/tomaarsen/SpanMarkerNER}
	}
	```

	<!--
	## Glossary

	Clearly define terms in order to be accessible across audiences.
	-->

	<!--
	## Model Card Authors

	Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.
	-->

	<!--
	## Model Card Contact

	Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.
	-->