Historical NER Baseline

This model is a historical named entity recognition model trained on HIPE-style historical newspaper data.

It is based on dbmdz/bert-base-historic-multilingual-cased and fine-tuned for token classification using the NE-COARSE-LIT annotation column.

The model predicts coarse named entity labels for historical French text.

Model description

This is the baseline model from the historical-ner experiments.

Model	Base encoder	Additional layers	Temporal encoding	Description
`baseline`	`dbmdz/bert-base-historic-multilingual-cased`	0	No	Historical BERT with a token-classification head

Label normalization

The original data contained labels from slightly different annotation schemes. Labels were normalized before training.

Examples:

B-PER       -> B-pers
I-PER       -> I-pers
B-LOC       -> B-loc
I-ORG       -> I-org
B-STREET    -> B-loc
B-BUILDING  -> B-loc
B-HumanProd -> B-prod
B-object    -> B-prod
B-work      -> B-prod
B-date      -> B-time

Evaluation

Evaluation was performed on:

data/hipe2020/fr/HIPE-2022-v2.1-hipe2020-test-fr.tsv

Overall results

Model	Overall P	Overall R	Overall F1	Loss	loc F1	org F1	pers F1	prod F1	time F1	Macro F1	Weighted F1	Epochs
`baseline`	0.7797	0.7543	0.7668	0.1132	0.87	0.61	0.68	0.66	0.40	0.65	0.76	5

Per-class results

Entity type	Precision	Recall	F1	Support
`loc`	0.87	0.87	0.87	797
`org`	0.66	0.57	0.61	128
`pers`	0.68	0.69	0.68	530
`prod`	0.77	0.58	0.66	59
`time`	0.53	0.32	0.40	53
micro avg	0.78	0.75	0.77	1567
macro avg	0.70	0.61	0.65	1567
weighted avg	0.78	0.75	0.76	1567

How to use

Install the required packages:

pip install torch transformers

Then run:

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model_id = "emanuelaboros/historical-ner-baseline"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)

ner = pipeline(
    "token-classification",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="simple",
)

text = "Charlotte née Bourgoin, femme de Joseph Digiez, fut admise par le Conseil."

predictions = ner(text)

for entity in predictions:
    print(entity)

Example output format:

[
    {
        "entity_group": "pers",
        "score": 0.98,
        "word": "Charlotte née Bourgoin",
        "start": 0,
        "end": 24,
    },
    {
        "entity_group": "pers",
        "score": 0.97,
        "word": "Joseph Digiez",
        "start": 35,
        "end": 48,
    },
    {
        "entity_group": "org",
        "score": 0.91,
        "word": "Conseil",
        "start": 70,
        "end": 77,
    },
]