SpanMarker with bert-base-cased on conll2002
This is a SpanMarker model trained on the conll2002 dataset that can be used for Named Entity Recognition. This SpanMarker model uses bert-base-cased as the underlying encoder.
Model Details
Model Description
- Model Type: SpanMarker
- Encoder: bert-base-cased
- Maximum Sequence Length: 256 tokens
- Maximum Entity Length: 8 words
- Training Dataset: conll2002
- Language: es
- License: cc-by-sa-4.0
Model Sources
Model Labels
Label |
Examples |
LOC |
"Victoria", "Australia", "Melbourne" |
MISC |
"Ley", "Ciudad", "CrimeNet" |
ORG |
"Tribunal Supremo", "EFE", "Commonwealth" |
PER |
"Abogado General del Estado", "Daryl Williams", "Abogado General" |
Evaluation
Metrics
Label |
Precision |
Recall |
F1 |
all |
0.8331 |
0.8074 |
0.8201 |
LOC |
0.8471 |
0.7759 |
0.8099 |
MISC |
0.7092 |
0.4264 |
0.5326 |
ORG |
0.7854 |
0.8558 |
0.8191 |
PER |
0.9471 |
0.9329 |
0.9400 |
Uses
Direct Use for Inference
from span_marker import SpanMarkerModel
model = SpanMarkerModel.from_pretrained("span_marker_model_id")
entities = model.predict("(SV2147) PP: PROBLEMAS INTERNOS PSOE INTERFIEREN EN POLITICA DE LA JUNTA Córdoba (EFE).")
Downstream Use
You can finetune this model on your own dataset.
Click to expand
from span_marker import SpanMarkerModel, Trainer
model = SpanMarkerModel.from_pretrained("span_marker_model_id")
dataset = load_dataset("conll2003")
trainer = Trainer(
model=model,
train_dataset=dataset["train"],
eval_dataset=dataset["validation"],
)
trainer.train()
trainer.save_model("span_marker_model_id-finetuned")
Training Details
Training Set Metrics
Training set |
Min |
Median |
Max |
Sentence length |
0 |
31.8014 |
1238 |
Entities per sentence |
0 |
2.2583 |
160 |
Training Hyperparameters
- learning_rate: 5e-05
- train_batch_size: 4
- eval_batch_size: 4
- seed: 42
- gradient_accumulation_steps: 2
- total_train_batch_size: 8
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.1
- num_epochs: 1
- mixed_precision_training: Native AMP
Training Results
Epoch |
Step |
Validation Loss |
Validation Precision |
Validation Recall |
Validation F1 |
Validation Accuracy |
0.1164 |
200 |
0.0260 |
0.6907 |
0.5358 |
0.6035 |
0.9264 |
0.2328 |
400 |
0.0199 |
0.7567 |
0.6384 |
0.6925 |
0.9414 |
0.3491 |
600 |
0.0176 |
0.7773 |
0.7273 |
0.7515 |
0.9563 |
0.4655 |
800 |
0.0157 |
0.8066 |
0.7598 |
0.7825 |
0.9601 |
0.5819 |
1000 |
0.0158 |
0.8031 |
0.7413 |
0.7710 |
0.9605 |
0.6983 |
1200 |
0.0156 |
0.7975 |
0.7598 |
0.7782 |
0.9609 |
0.8147 |
1400 |
0.0139 |
0.8210 |
0.7615 |
0.7901 |
0.9625 |
0.9310 |
1600 |
0.0129 |
0.8426 |
0.7848 |
0.8127 |
0.9651 |
Framework Versions
- Python: 3.10.12
- SpanMarker: 1.5.0
- Transformers: 4.38.2
- PyTorch: 2.2.1+cu121
- Datasets: 2.18.0
- Tokenizers: 0.15.2
Citation
BibTeX
@software{Aarsen_SpanMarker,
author = {Aarsen, Tom},
license = {Apache-2.0},
title = {{SpanMarker for Named Entity Recognition}},
url = {https://github.com/tomaarsen/SpanMarkerNER}
}