Historical Entities Models
Collection
4 items • Updated
This model is a historical named entity recognition model trained on HIPE-style historical newspaper data.
It is based on dbmdz/bert-base-historic-multilingual-cased and fine-tuned for token classification using the NE-COARSE-LIT annotation column.
The model predicts coarse named entity labels for historical French text.
This is the baseline model from the historical-ner experiments.
| Model | Base encoder | Additional layers | Temporal encoding | Description |
|---|---|---|---|---|
baseline |
dbmdz/bert-base-historic-multilingual-cased |
0 | No | Historical BERT with a token-classification head |
The original data contained labels from slightly different annotation schemes. Labels were normalized before training.
Examples:
B-PER -> B-pers
I-PER -> I-pers
B-LOC -> B-loc
I-ORG -> I-org
B-STREET -> B-loc
B-BUILDING -> B-loc
B-HumanProd -> B-prod
B-object -> B-prod
B-work -> B-prod
B-date -> B-time
Evaluation was performed on:
data/hipe2020/fr/HIPE-2022-v2.1-hipe2020-test-fr.tsv
| Model | Overall P | Overall R | Overall F1 | Loss | loc F1 | org F1 | pers F1 | prod F1 | time F1 | Macro F1 | Weighted F1 | Epochs |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
baseline |
0.7797 | 0.7543 | 0.7668 | 0.1132 | 0.87 | 0.61 | 0.68 | 0.66 | 0.40 | 0.65 | 0.76 | 5 |
| Entity type | Precision | Recall | F1 | Support |
|---|---|---|---|---|
loc |
0.87 | 0.87 | 0.87 | 797 |
org |
0.66 | 0.57 | 0.61 | 128 |
pers |
0.68 | 0.69 | 0.68 | 530 |
prod |
0.77 | 0.58 | 0.66 | 59 |
time |
0.53 | 0.32 | 0.40 | 53 |
| micro avg | 0.78 | 0.75 | 0.77 | 1567 |
| macro avg | 0.70 | 0.61 | 0.65 | 1567 |
| weighted avg | 0.78 | 0.75 | 0.76 | 1567 |
Install the required packages:
pip install torch transformers
Then run:
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
model_id = "emanuelaboros/historical-ner-baseline"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)
ner = pipeline(
"token-classification",
model=model,
tokenizer=tokenizer,
aggregation_strategy="simple",
)
text = "Charlotte née Bourgoin, femme de Joseph Digiez, fut admise par le Conseil."
predictions = ner(text)
for entity in predictions:
print(entity)
Example output format:
[
{
"entity_group": "pers",
"score": 0.98,
"word": "Charlotte née Bourgoin",
"start": 0,
"end": 24,
},
{
"entity_group": "pers",
"score": 0.97,
"word": "Joseph Digiez",
"start": 35,
"end": 48,
},
{
"entity_group": "org",
"score": 0.91,
"word": "Conseil",
"start": 70,
"end": 77,
},
]
Base model
dbmdz/bert-base-historic-multilingual-cased