The Impresso NER model is based on the stacked Transformer architecture published in CoNLL 2020 trained on the Impresso HIPE-2020 portion of the HIPE-2022 dataset. It recognizes entity types such as person, location, and organization while supporting the complete HIPE typology, including coarse and fine-grained entity types as well as components like names, titles, and roles. Additionally, the NER model's backbone (dbmdz/bert-medium-historic-multilingual-cased) was trained on various European historical datasets, giving it a broader language capability. This training included data from the Europeana and British Library collections across multiple languages: German, French, English, Finnish, and Swedish. Due to this multilingual backbone, the NER model may also recognize entities in other languages beyond French and German.
How to use
You can use this model with Transformers pipeline for NER.
# Import necessary Python modules from the Transformers library
from transformers import AutoModelForTokenClassification, AutoTokenizer
from transformers import pipeline
# Define the model name to be used for token classification, we use the Impresso NER
# that can be found at "https://huggingface.co/impresso-project/ner-stacked-bert-multilingual"
MODEL_NAME = "impresso-project/ner-stacked-bert-multilingual"
# Load the tokenizer corresponding to the specified model name
ner_tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
ner_pipeline = pipeline("generic-ner", model=MODEL_NAME,
tokenizer=ner_tokenizer,
trust_remote_code=True,
device='cpu')
sentence = "En l'an 1348, au plus fort des ravages de la peste noire à travers l'Europe, le Royaume de France se trouvait à la fois au bord du désespoir et face à une opportunité. À la cour du roi Philippe VI, les murs du Louvre étaient animés par les rapports sombres venus de Paris et des villes environnantes. La peste ne montrait aucun signe de répit, et le chancelier Guillaume de Nogaret, le conseiller le plus fidèle du roi, portait le lourd fardeau de gérer la survie du royaume."
entities = ner_pipeline(sentence)
print(entities)
[
{'type': 'time', 'confidence_ner': 85.0, 'surface': "an 1348", 'lOffset': 0, 'rOffset': 12},
{'type': 'loc', 'confidence_ner': 90.75, 'surface': 'Europe', 'lOffset': 69, 'rOffset': 75},
{'type': 'loc', 'confidence_ner': 75.45, 'surface': 'Royaume de France', 'lOffset': 80, 'rOffset': 97},
{'type': 'pers', 'confidence_ner': 85.27, 'surface': 'roi Philippe VI', 'lOffset': 181, 'rOffset': 196, 'title': 'roi', 'name': 'roi Philippe VI'},
{'type': 'loc', 'confidence_ner': 30.59, 'surface': 'Louvre', 'lOffset': 210, 'rOffset': 216},
{'type': 'loc', 'confidence_ner': 94.46, 'surface': 'Paris', 'lOffset': 266, 'rOffset': 271},
{'type': 'pers', 'confidence_ner': 96.1, 'surface': 'chancelier Guillaume de Nogaret', 'lOffset': 350, 'rOffset': 381, 'title': 'chancelier', 'name': 'chancelier Guillaume de Nogaret'},
{'type': 'loc', 'confidence_ner': 49.35, 'surface': 'Royaume', 'lOffset': 80, 'rOffset': 87},
{'type': 'loc', 'confidence_ner': 24.18, 'surface': 'France', 'lOffset': 91, 'rOffset': 97}
]
BibTeX entry and citation info
@inproceedings{boros2020alleviating,
title={Alleviating digitization errors in named entity recognition for historical documents},
author={Boros, Emanuela and Hamdi, Ahmed and Pontes, Elvys Linhares and Cabrera-Diego, Luis-Adri{\'a}n and Moreno, Jose G and Sidere, Nicolas and Doucet, Antoine},
booktitle={Proceedings of the 24th conference on computational natural language learning},
pages={431--441},
year={2020}
}
- Downloads last month
- 6,404