Edit model card

SpanMarker with bert-base-multilingual-cased on TLUnified

This is a SpanMarker model trained on the TLUnified dataset that can be used for Named Entity Recognition. This SpanMarker model uses bert-base-multilingual-cased as the underlying encoder.

Model Details

Model Description

  • Model Type: SpanMarker
  • Encoder: bert-base-multilingual-cased
  • Maximum Sequence Length: 256 tokens
  • Maximum Entity Length: 8 words
  • Training Dataset: TLUnified
  • Language: tl
  • License: gpl-3.0

Model Sources

Model Labels

Label Examples
LOC "Israel", "Batasan", "United States"
ORG "MMDA", "International Monitoring Team", "Coordinating Committees for the Cessation of Hostilities"
PER "Puno", "Fernando", "Villavicencio"

Evaluation

Metrics

Label Precision Recall F1
all 0.8737 0.9042 0.8887
LOC 0.8830 0.9084 0.8955
ORG 0.7579 0.8587 0.8052
PER 0.9264 0.9220 0.9242

Uses

Direct Use for Inference

from span_marker import SpanMarkerModel

# Download from the 🤗 Hub
model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-mbert-base-tlunified")
# Run inference
entities = model.predict("Idinagdag ni South Cotabato Rep Darlene Antonino - Custodio, na illegal na ipagpaliban ang halalan sa ARMM kung ang gagamitin lamang basehan ay ang ipapasang panukala ng Kongreso.")

Downstream Use

You can finetune this model on your own dataset.

Click to expand
from span_marker import SpanMarkerModel, Trainer

# Download from the 🤗 Hub
model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-mbert-base-tlunified")

# Specify a Dataset with "tokens" and "ner_tag" columns
dataset = load_dataset("conll2003") # For example CoNLL2003

# Initialize a Trainer using the pretrained model & dataset
trainer = Trainer(
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
)
trainer.train()
trainer.save_model("tomaarsen/span-marker-mbert-base-tlunified-finetuned")

Training Details

Training Set Metrics

Training set Min Median Max
Sentence length 1 31.7625 150
Entities per sentence 0 2.0661 38

Training Hyperparameters

  • learning_rate: 5e-05
  • train_batch_size: 16
  • eval_batch_size: 16
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_ratio: 0.1
  • num_epochs: 3

Training Results

Epoch Step Validation Loss Validation Precision Validation Recall Validation F1 Validation Accuracy
0.6803 400 0.0074 0.8552 0.8835 0.8691 0.9774
1.3605 800 0.0072 0.8709 0.9034 0.8869 0.9798
2.0408 1200 0.0070 0.8753 0.9053 0.8900 0.9812
2.7211 1600 0.0065 0.8876 0.9003 0.8939 0.9807

Environmental Impact

Carbon emissions were measured using CodeCarbon.

  • Carbon Emitted: 0.022 kg of CO2
  • Hours Used: 0.238 hours

Training Hardware

  • On Cloud: No
  • GPU Model: 1 x NVIDIA GeForce RTX 3090
  • CPU Model: 13th Gen Intel(R) Core(TM) i7-13700K
  • RAM Size: 31.78 GB

Framework Versions

  • Python: 3.9.16
  • SpanMarker: 1.5.1.dev
  • Transformers: 4.30.0
  • PyTorch: 2.0.1+cu118
  • Datasets: 2.14.0
  • Tokenizers: 0.13.3

Citation

BibTeX

@software{Aarsen_SpanMarker,
    author = {Aarsen, Tom},
    license = {Apache-2.0},
    title = {{SpanMarker for Named Entity Recognition}},
    url = {https://github.com/tomaarsen/SpanMarkerNER}
}
Downloads last month
4

Finetuned from

Dataset used to train tomaarsen/span-marker-mbert-base-tlunified

Collection including tomaarsen/span-marker-mbert-base-tlunified

Evaluation results