Upload model

8cf2e3c 8 months ago

No virus

8.33 kB

	---
	language:
	- tl
	license: gpl-3.0
	library_name: span-marker
	tags:
	- span-marker
	- token-classification
	- ner
	- named-entity-recognition
	- generated_from_span_marker_trainer
	datasets:
	- ljvmiranda921/tlunified-ner
	metrics:
	- precision
	- recall
	- f1
	widget:
	- text: MANILA - Binalewala ng Philippine National Police (PNP) nitong Sabado ang
	posibleng paglulunsad ng tinatawag na " sympathy attacks " ng Moro National Liberation
	Front (MNLF) at Abu Sayyaf matapos arestuhin si Indanan, Sulu Mayor Alvarez Isnaji.
	- text: Pinatawan din ng apat na buwang suspensyon si Herma Gonzales - Escudero, chief
	revenue officer III ng BIR - Cotabato City, dahil sa kasong dishonesty at limang
	kaso ng perjury sa Municipal Trial Court ng Cotabato City . Bunga ito ng kanyang
	kabiguan na ideklara sa kanyang SALN noong 2002 - 2004 ang 200 metro kwadradong
	lote sa South Cotabato at Toyota Revo noong 2001 SALN at undervaluation ng kanyang
	mga ari - arian sa lalawigan noong 2000 - 2004 SALN.
	- text: Sa tila pagpapabaya sa mga magsasaka, sinabi ni Escudero na hindi mangyayari
	ang pangarap ng Department of Agriculture (DA) na maging self - sufficient ang
	Pilipinas sa bigas.
	- text: MANILA - Tiniyak ng pinuno ng Government Service Insurance System (GSIS) na
	tatapatan nito ang pro - Meralco advertisement ni Judy Ann Santos upang isulong
	ang kanyang posisyon na dapat ibaba ang singil sa kuryente.
	- text: Idinagdag ni South Cotabato Rep Darlene Antonino - Custodio, na illegal na
	ipagpaliban ang halalan sa ARMM kung ang gagamitin lamang basehan ay ang ipapasang
	panukala ng Kongreso.
	pipeline_tag: token-classification
	co2_eq_emissions:
	emissions: 17.80725395240375
	source: codecarbon
	training_type: fine-tuning
	on_cloud: false
	cpu_model: 13th Gen Intel(R) Core(TM) i7-13700K
	ram_total_size: 31.777088165283203
	hours_used: 0.142
	hardware_used: 1 x NVIDIA GeForce RTX 3090
	base_model: jcblaise/roberta-tagalog-base
	model-index:
	- name: SpanMarker with jcblaise/roberta-tagalog-base on TLUnified
	results:
	- task:
	type: token-classification
	name: Named Entity Recognition
	dataset:
	name: TLUnified
	type: ljvmiranda921/tlunified-ner
	split: test
	metrics:
	- type: f1
	value: 0.8962499999999999
	name: F1
	- type: precision
	value: 0.8830049261083743
	name: Precision
	- type: recall
	value: 0.9098984771573604
	name: Recall
	---

	# SpanMarker with jcblaise/roberta-tagalog-base on TLUnified

	This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model trained on the [TLUnified](https://huggingface.co/datasets/ljvmiranda921/tlunified-ner) dataset that can be used for Named Entity Recognition. This SpanMarker model uses [jcblaise/roberta-tagalog-base](https://huggingface.co/jcblaise/roberta-tagalog-base) as the underlying encoder.

	## Model Details

	### Model Description
	- Model Type: SpanMarker
	- Encoder: [jcblaise/roberta-tagalog-base](https://huggingface.co/jcblaise/roberta-tagalog-base)
	- Maximum Sequence Length: 256 tokens
	- Maximum Entity Length: 8 words
	- Training Dataset: [TLUnified](https://huggingface.co/datasets/ljvmiranda921/tlunified-ner)
	- Language: tl
	- License: gpl-3.0

	### Model Sources

	- Repository: [SpanMarker on GitHub](https://github.com/tomaarsen/SpanMarkerNER)
	- Thesis: [SpanMarker For Named Entity Recognition](https://raw.githubusercontent.com/tomaarsen/SpanMarkerNER/main/thesis.pdf)

	### Model Labels
	\| Label \| Examples \|
	\|:------\|:----------------------------------------------------------------------------------------------------\|
	\| LOC \| "Batasan", "United States", "Israel" \|
	\| ORG \| "MMDA", "International Monitoring Team", "Coordinating Committees for the Cessation of Hostilities" \|
	\| PER \| "Villavicencio", "Puno", "Fernando" \|

	## Evaluation

	### Metrics
	\| Label \| Precision \| Recall \| F1 \|
	\|:--------\|:----------\|:-------\|:-------\|
	\| all \| 0.8830 \| 0.9099 \| 0.8962 \|
	\| LOC \| 0.8831 \| 0.9293 \| 0.9056 \|
	\| ORG \| 0.7948 \| 0.8476 \| 0.8204 \|
	\| PER \| 0.9235 \| 0.9280 \| 0.9257 \|

	## Uses

	### Direct Use for Inference

	```python
	from span_marker import SpanMarkerModel

	# Download from the 🤗 Hub
	model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-roberta-tagalog-base-tlunified")
	# Run inference
	entities = model.predict("Idinagdag ni South Cotabato Rep Darlene Antonino - Custodio, na illegal na ipagpaliban ang halalan sa ARMM kung ang gagamitin lamang basehan ay ang ipapasang panukala ng Kongreso.")
	```

	### Downstream Use
	You can finetune this model on your own dataset.

	<details><summary>Click to expand</summary>

	```python
	from span_marker import SpanMarkerModel, Trainer

	# Download from the 🤗 Hub
	model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-roberta-tagalog-base-tlunified")

	# Specify a Dataset with "tokens" and "ner_tag" columns
	dataset = load_dataset("conll2003") # For example CoNLL2003

	# Initialize a Trainer using the pretrained model & dataset
	trainer = Trainer(
	model=model,
	train_dataset=dataset["train"],
	eval_dataset=dataset["validation"],
	)
	trainer.train()
	trainer.save_model("tomaarsen/span-marker-roberta-tagalog-base-tlunified-finetuned")
	```
	</details>

	<!--
	### Out-of-Scope Use

	List how the model may foreseeably be misused and address what users ought not to do with the model.
	-->

	<!--
	## Bias, Risks and Limitations

	What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.
	-->

	<!--
	### Recommendations

	What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.
	-->

	## Training Details

	### Training Set Metrics
	\| Training set \| Min \| Median \| Max \|
	\|:----------------------\|:----\|:--------\|:----\|
	\| Sentence length \| 1 \| 31.7625 \| 150 \|
	\| Entities per sentence \| 0 \| 2.0661 \| 38 \|

	### Training Hyperparameters
	- learning_rate: 5e-05
	- train_batch_size: 32
	- eval_batch_size: 32
	- seed: 42
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- lr_scheduler_warmup_ratio: 0.1
	- num_epochs: 3

	### Training Results
	\| Epoch \| Step \| Validation Loss \| Validation Precision \| Validation Recall \| Validation F1 \| Validation Accuracy \|
	\|:------:\|:----:\|:---------------:\|:--------------------:\|:-----------------:\|:-------------:\|:-------------------:\|
	\| 0.6969 \| 200 \| 0.0083 \| 0.8827 \| 0.8628 \| 0.8726 \| 0.9762 \|
	\| 1.3937 \| 400 \| 0.0067 \| 0.8881 \| 0.8959 \| 0.8920 \| 0.9798 \|
	\| 2.0906 \| 600 \| 0.0069 \| 0.8820 \| 0.9040 \| 0.8929 \| 0.9800 \|
	\| 2.7875 \| 800 \| 0.0070 \| 0.8757 \| 0.9133 \| 0.8941 \| 0.9807 \|

	### Environmental Impact
	Carbon emissions were measured using [CodeCarbon](https://github.com/mlco2/codecarbon).
	- Carbon Emitted: 0.018 kg of CO2
	- Hours Used: 0.142 hours

	### Training Hardware
	- On Cloud: No
	- GPU Model: 1 x NVIDIA GeForce RTX 3090
	- CPU Model: 13th Gen Intel(R) Core(TM) i7-13700K
	- RAM Size: 31.78 GB

	### Framework Versions
	- Python: 3.9.16
	- SpanMarker: 1.5.1.dev
	- Transformers: 4.30.0
	- PyTorch: 2.0.1+cu118
	- Datasets: 2.14.0
	- Tokenizers: 0.13.3

	## Citation

	### BibTeX
	```
	@software{Aarsen_SpanMarker,
	author = {Aarsen, Tom},
	license = {Apache-2.0},
	title = {{SpanMarker for Named Entity Recognition}},
	url = {https://github.com/tomaarsen/SpanMarkerNER}
	}
	```

	<!--
	## Glossary

	Clearly define terms in order to be accessible across audiences.
	-->

	<!--
	## Model Card Authors

	Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.
	-->

	<!--
	## Model Card Contact

	Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.
	-->