Edit model card

SpanMarker with jcblaise/roberta-tagalog-base on TLUnified

This is a SpanMarker model trained on the TLUnified dataset that can be used for Named Entity Recognition. This SpanMarker model uses jcblaise/roberta-tagalog-base as the underlying encoder.

Model Details

Model Description

Model Sources

Model Labels

Label Examples
LOC "Batasan", "United States", "Israel"
ORG "MMDA", "International Monitoring Team", "Coordinating Committees for the Cessation of Hostilities"
PER "Villavicencio", "Puno", "Fernando"

Evaluation

Metrics

Label Precision Recall F1
all 0.8830 0.9099 0.8962
LOC 0.8831 0.9293 0.9056
ORG 0.7948 0.8476 0.8204
PER 0.9235 0.9280 0.9257

Uses

Direct Use for Inference

from span_marker import SpanMarkerModel

# Download from the 🤗 Hub
model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-roberta-tagalog-base-tlunified")
# Run inference
entities = model.predict("Idinagdag ni South Cotabato Rep Darlene Antonino - Custodio, na illegal na ipagpaliban ang halalan sa ARMM kung ang gagamitin lamang basehan ay ang ipapasang panukala ng Kongreso.")

Downstream Use

You can finetune this model on your own dataset.

Click to expand
from span_marker import SpanMarkerModel, Trainer

# Download from the 🤗 Hub
model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-roberta-tagalog-base-tlunified")

# Specify a Dataset with "tokens" and "ner_tag" columns
dataset = load_dataset("conll2003") # For example CoNLL2003

# Initialize a Trainer using the pretrained model & dataset
trainer = Trainer(
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
)
trainer.train()
trainer.save_model("tomaarsen/span-marker-roberta-tagalog-base-tlunified-finetuned")

Training Details

Training Set Metrics

Training set Min Median Max
Sentence length 1 31.7625 150
Entities per sentence 0 2.0661 38

Training Hyperparameters

  • learning_rate: 5e-05
  • train_batch_size: 32
  • eval_batch_size: 32
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_ratio: 0.1
  • num_epochs: 3

Training Results

Epoch Step Validation Loss Validation Precision Validation Recall Validation F1 Validation Accuracy
0.6969 200 0.0083 0.8827 0.8628 0.8726 0.9762
1.3937 400 0.0067 0.8881 0.8959 0.8920 0.9798
2.0906 600 0.0069 0.8820 0.9040 0.8929 0.9800
2.7875 800 0.0070 0.8757 0.9133 0.8941 0.9807

Environmental Impact

Carbon emissions were measured using CodeCarbon.

  • Carbon Emitted: 0.018 kg of CO2
  • Hours Used: 0.142 hours

Training Hardware

  • On Cloud: No
  • GPU Model: 1 x NVIDIA GeForce RTX 3090
  • CPU Model: 13th Gen Intel(R) Core(TM) i7-13700K
  • RAM Size: 31.78 GB

Framework Versions

  • Python: 3.9.16
  • SpanMarker: 1.5.1.dev
  • Transformers: 4.30.0
  • PyTorch: 2.0.1+cu118
  • Datasets: 2.14.0
  • Tokenizers: 0.13.3

Citation

BibTeX

@software{Aarsen_SpanMarker,
    author = {Aarsen, Tom},
    license = {Apache-2.0},
    title = {{SpanMarker for Named Entity Recognition}},
    url = {https://github.com/tomaarsen/SpanMarkerNER}
}
Downloads last month
5

Finetuned from

Dataset used to train tomaarsen/span-marker-roberta-tagalog-base-tlunified

Collection including tomaarsen/span-marker-roberta-tagalog-base-tlunified

Evaluation results