Historical NER Baseline

This model is a historical named entity recognition model trained on HIPE-style historical newspaper data.

It is based on dbmdz/bert-base-historic-multilingual-cased and fine-tuned for token classification using the NE-COARSE-LIT annotation column.

The model predicts coarse named entity labels for historical French text.

Model description

This is the baseline model from the historical-ner experiments.

Model Base encoder Additional layers Temporal encoding Description
baseline dbmdz/bert-base-historic-multilingual-cased 0 No Historical BERT with a token-classification head

Label normalization

The original data contained labels from slightly different annotation schemes. Labels were normalized before training.

Examples:

B-PER       -> B-pers
I-PER       -> I-pers
B-LOC       -> B-loc
I-ORG       -> I-org
B-STREET    -> B-loc
B-BUILDING  -> B-loc
B-HumanProd -> B-prod
B-object    -> B-prod
B-work      -> B-prod
B-date      -> B-time

Evaluation

Evaluation was performed on:

data/hipe2020/fr/HIPE-2022-v2.1-hipe2020-test-fr.tsv

Overall results

Model Overall P Overall R Overall F1 Loss loc F1 org F1 pers F1 prod F1 time F1 Macro F1 Weighted F1 Epochs
baseline 0.7797 0.7543 0.7668 0.1132 0.87 0.61 0.68 0.66 0.40 0.65 0.76 5

Per-class results

Entity type Precision Recall F1 Support
loc 0.87 0.87 0.87 797
org 0.66 0.57 0.61 128
pers 0.68 0.69 0.68 530
prod 0.77 0.58 0.66 59
time 0.53 0.32 0.40 53
micro avg 0.78 0.75 0.77 1567
macro avg 0.70 0.61 0.65 1567
weighted avg 0.78 0.75 0.76 1567

How to use

Install the required packages:

pip install torch transformers

Then run:

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model_id = "emanuelaboros/historical-ner-baseline"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)

ner = pipeline(
    "token-classification",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="simple",
)

text = "Charlotte née Bourgoin, femme de Joseph Digiez, fut admise par le Conseil."

predictions = ner(text)

for entity in predictions:
    print(entity)

Example output format:

[
    {
        "entity_group": "pers",
        "score": 0.98,
        "word": "Charlotte née Bourgoin",
        "start": 0,
        "end": 24,
    },
    {
        "entity_group": "pers",
        "score": 0.97,
        "word": "Joseph Digiez",
        "start": 35,
        "end": 48,
    },
    {
        "entity_group": "org",
        "score": 0.91,
        "word": "Conseil",
        "start": 70,
        "end": 77,
    },
]

Related models

Downloads last month
31
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for emanuelaboros/historical-ner-baseline

Finetuned
(263)
this model

Collection including emanuelaboros/historical-ner-baseline